# üß† JAMAL AI - Similarity Metric Learning untuk Brainstorming

## Topik: Similarity Metric Learning
**Mata Kuliah**: Kecerdasan Buatan

### Tujuan
Membangun model **Siamese Neural Network** dengan **Contrastive Loss** untuk mengukur kesamaan semantik antar ide/sticky notes di aplikasi brainstorming (seperti FigJam).

### Algoritma Utama
1. **Siamese Network** - Twin network dengan shared weights
2. **Contrastive Loss** - Loss function untuk metric learning
3. **Euclidean Distance** - Mengukur jarak di embedding space

### Dataset
- **STS Benchmark** (Semantic Textual Similarity) - Dataset standar untuk similarity tasks
- **Custom Brainstorming Test Data** - Data pengujian domain-specific

In [None]:
# =============================================================================
# CELL 1: SETUP ENVIRONMENT
# =============================================================================

import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, Model, Input
import tensorflow.keras.backend as K
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

print(f"TensorFlow Version: {tf.__version__}")
print("‚úÖ Environment Ready!")

In [None]:
# =============================================================================
# CELL 2: KONFIGURASI MODEL
# =============================================================================

# --- HYPERPARAMETERS ---
MAX_VOCAB = 15000        # Ukuran vocabulary
MAX_LEN = 50             # Panjang maksimum sequence (dinaikkan untuk STS)
EMBEDDING_DIM = 128      # Dimensi embedding (dinaikkan)
LSTM_UNITS = 128         # Units LSTM (dinaikkan)
DENSE_UNITS = 64         # Output embedding dimension
BATCH_SIZE = 64          # Batch size
EPOCHS = 15              # Epoch training
MARGIN = 1.0             # Margin untuk contrastive loss

print("üìã KONFIGURASI MODEL:")
print(f"   MAX_VOCAB: {MAX_VOCAB}")
print(f"   MAX_LEN: {MAX_LEN}")
print(f"   EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"   LSTM_UNITS: {LSTM_UNITS}")
print(f"   DENSE_UNITS: {DENSE_UNITS}")
print(f"   MARGIN: {MARGIN}")

In [None]:
# =============================================================================
# CELL 3: LOAD STS BENCHMARK DATASET
# =============================================================================

# Install datasets library jika belum ada
!pip install -q datasets

from datasets import load_dataset

print("üîÑ Loading STS Benchmark Dataset...")

# Load STS Benchmark dari Hugging Face
try:
    sts_train = load_dataset("mteb/stsbenchmark-sts", split="train")
    sts_test = load_dataset("mteb/stsbenchmark-sts", split="test")
    
    print(f"‚úÖ Train samples: {len(sts_train)}")
    print(f"‚úÖ Test samples: {len(sts_test)}")
    
    # Show sample
    print("\nüìù Sample Data:")
    for i in range(3):
        sample = sts_train[i]
        print(f"   S1: {sample['sentence1']}")
        print(f"   S2: {sample['sentence2']}")
        print(f"   Score: {sample['score']:.2f}/5.0")
        print()
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    print("   Pastikan Internet enabled di Kaggle Settings!")

In [None]:
# =============================================================================
# CELL 4: PREPROCESSING DATA
# =============================================================================

def prepare_sts_data(dataset, threshold=2.5):
    """
    Konversi STS dataset (score 0-5) ke binary labels.
    Score >= threshold = Similar (1)
    Score < threshold = Different (0)
    """
    sentences1 = []
    sentences2 = []
    labels = []
    scores = []  # Keep original scores for evaluation
    
    for sample in dataset:
        s1 = str(sample['sentence1']).strip()
        s2 = str(sample['sentence2']).strip()
        score = float(sample['score'])
        
        if s1 and s2:  # Skip empty
            sentences1.append(s1)
            sentences2.append(s2)
            scores.append(score)
            # Binary: 1 jika score >= threshold (mirip), 0 jika tidak
            labels.append(1.0 if score >= threshold else 0.0)
    
    return np.array(sentences1), np.array(sentences2), np.array(labels), np.array(scores)

# Prepare data
print("üîÑ Preparing training data...")
train_s1, train_s2, train_labels, train_scores = prepare_sts_data(sts_train)
test_s1, test_s2, test_labels, test_scores = prepare_sts_data(sts_test)

print(f"\nüìä DATA STATISTICS:")
print(f"   Training pairs: {len(train_labels)}")
print(f"   Test pairs: {len(test_labels)}")
print(f"   Train - Similar (1): {sum(train_labels):.0f} ({sum(train_labels)/len(train_labels)*100:.1f}%)")
print(f"   Train - Different (0): {len(train_labels)-sum(train_labels):.0f} ({(1-sum(train_labels)/len(train_labels))*100:.1f}%)")

In [None]:
# =============================================================================
# CELL 5: TOKENISASI
# =============================================================================

print("üîÑ Tokenizing text...")

# Fit tokenizer pada semua teks
tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token="<OOV>")
all_texts = list(train_s1) + list(train_s2) + list(test_s1) + list(test_s2)
tokenizer.fit_on_texts(all_texts)

# Convert to sequences
train_seq1 = tokenizer.texts_to_sequences(train_s1)
train_seq2 = tokenizer.texts_to_sequences(train_s2)
test_seq1 = tokenizer.texts_to_sequences(test_s1)
test_seq2 = tokenizer.texts_to_sequences(test_s2)

# Padding
X1_train = pad_sequences(train_seq1, maxlen=MAX_LEN, padding='post')
X2_train = pad_sequences(train_seq2, maxlen=MAX_LEN, padding='post')
X1_test = pad_sequences(test_seq1, maxlen=MAX_LEN, padding='post')
X2_test = pad_sequences(test_seq2, maxlen=MAX_LEN, padding='post')

y_train = train_labels
y_test = test_labels

print(f"‚úÖ Tokenization complete!")
print(f"   Vocabulary size: {min(len(tokenizer.word_index)+1, MAX_VOCAB)}")
print(f"   X1_train shape: {X1_train.shape}")
print(f"   X2_train shape: {X2_train.shape}")

In [None]:
# =============================================================================
# CELL 6: SIAMESE NETWORK ARCHITECTURE
# =============================================================================

# --- EUCLIDEAN DISTANCE ---
def euclidean_distance(vectors):
    """
    Menghitung Euclidean distance antara dua vektor embedding.
    d(A, B) = sqrt(sum((A - B)^2))
    """
    (featsA, featsB) = vectors
    sum_squared = K.sum(K.square(featsA - featsB), axis=1, keepdims=True)
    return K.sqrt(K.maximum(sum_squared, K.epsilon()))

# --- CONTRASTIVE LOSS ---
def contrastive_loss(y_true, y_pred):
    """
    Contrastive Loss Function untuk Metric Learning.
    
    Formula:
    L = y * d^2 + (1 - y) * max(margin - d, 0)^2
    
    Dimana:
    - y = 1 (similar): Loss = d^2 (dorong jarak ke 0)
    - y = 0 (different): Loss = max(margin - d, 0)^2 (dorong jarak > margin)
    """
    square_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(MARGIN - y_pred, 0))
    return K.mean(y_true * square_pred + (1 - y_true) * margin_square)

print("üìê Loss Function: Contrastive Loss")
print(f"   Margin: {MARGIN}")
print("   - Similar pairs (y=1): minimize distance")
print("   - Different pairs (y=0): push distance > margin")

In [None]:
# =============================================================================
# CELL 7: BUILD SIAMESE MODEL
# =============================================================================

# --- BASE NETWORK (Shared Weights) ---
def create_base_network():
    """
    Base network yang akan di-share oleh kedua input.
    Architecture: Embedding -> LSTM -> Dense
    """
    input_seq = Input(shape=(MAX_LEN,), name='input_sequence')
    
    # Embedding layer
    x = layers.Embedding(MAX_VOCAB, EMBEDDING_DIM, name='embedding')(input_seq)
    
    # Bidirectional LSTM untuk capture context dari dua arah
    x = layers.Bidirectional(layers.LSTM(LSTM_UNITS, return_sequences=False), name='bilstm')(x)
    
    # Dropout untuk regularization
    x = layers.Dropout(0.3)(x)
    
    # Dense layer untuk output embedding
    x = layers.Dense(DENSE_UNITS, activation='relu', name='dense')(x)
    
    # L2 Normalize embedding (penting untuk metric learning)
    x = layers.Lambda(lambda t: K.l2_normalize(t, axis=1), name='l2_norm')(x)
    
    return Model(input_seq, x, name='base_network')

# Create base network
base_network = create_base_network()
print("üîß BASE NETWORK ARCHITECTURE:")
base_network.summary()

# --- SIAMESE MODEL ---
input_a = Input(shape=(MAX_LEN,), name='input_sentence_1')
input_b = Input(shape=(MAX_LEN,), name='input_sentence_2')

# Share weights - kedua input menggunakan network yang sama
embedding_a = base_network(input_a)
embedding_b = base_network(input_b)

# Compute distance
distance = layers.Lambda(euclidean_distance, name='euclidean_distance')([embedding_a, embedding_b])

# Final model
siamese_model = Model(inputs=[input_a, input_b], outputs=distance, name='siamese_network')
siamese_model.compile(loss=contrastive_loss, optimizer='adam')

print("\nüîß SIAMESE NETWORK ARCHITECTURE:")
siamese_model.summary()

In [None]:
# =============================================================================
# CELL 8: TRAINING
# =============================================================================

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# Callbacks
early_stop = EarlyStopping(
    monitor='val_loss', 
    patience=3, 
    verbose=1,
    restore_best_weights=True
)

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=2,
    verbose=1,
    min_lr=1e-6
)

checkpoint = ModelCheckpoint(
    'jamal_metric_learning_best.h5',
    monitor='val_loss',
    save_best_only=True,
    verbose=1
)

print("üöÄ MEMULAI TRAINING SIAMESE NETWORK...")
print(f"   Epochs: {EPOCHS}")
print(f"   Batch Size: {BATCH_SIZE}")
print()

history = siamese_model.fit(
    [X1_train, X2_train], y_train,
    validation_data=([X1_test, X2_test], y_test),
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks=[early_stop, reduce_lr, checkpoint],
    verbose=1
)

print("\n‚úÖ Training Complete!")

In [None]:
# =============================================================================
# CELL 9: TRAINING VISUALIZATION
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_title('Contrastive Loss During Training', fontsize=12)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Final metrics
final_train_loss = history.history['loss'][-1]
final_val_loss = history.history['val_loss'][-1]
best_val_loss = min(history.history['val_loss'])

metrics_text = f"""Training Summary:

Final Train Loss: {final_train_loss:.4f}
Final Val Loss: {final_val_loss:.4f}
Best Val Loss: {best_val_loss:.4f}

Total Epochs: {len(history.history['loss'])}
"""

axes[1].text(0.5, 0.5, metrics_text, fontsize=14, 
             ha='center', va='center', family='monospace',
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
axes[1].axis('off')
axes[1].set_title('Training Summary', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# =============================================================================
# CELL 10: EVALUATION METRICS (AKADEMIS)
# =============================================================================

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from scipy.stats import pearsonr, spearmanr

# Predict distances
print("üîÑ Generating predictions on test set...")
pred_distances = siamese_model.predict([X1_test, X2_test], verbose=0).ravel()

# Find optimal threshold using validation data
thresholds = np.arange(0.1, 1.5, 0.05)
best_threshold = 0.5
best_f1 = 0

for thresh in thresholds:
    y_pred_temp = (pred_distances < thresh).astype(int)
    f1 = f1_score(y_test, y_pred_temp)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = thresh

print(f"\nüéØ Optimal Threshold (for classification): {best_threshold:.3f}")

# Apply optimal threshold
y_pred = (pred_distances < best_threshold).astype(int)

# Calculate metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# ROC-AUC (perlu invert distance karena distance kecil = similar)
roc_auc = roc_auc_score(y_test, -pred_distances)

# Correlation dengan original scores
pearson_corr, _ = pearsonr(test_scores, -pred_distances)
spearman_corr, _ = spearmanr(test_scores, -pred_distances)

print("\n" + "="*50)
print("üìä EVALUATION METRICS (TEST SET)")
print("="*50)
print(f"\nüéØ Classification Metrics:")
print(f"   Accuracy:  {acc:.4f}")
print(f"   Precision: {prec:.4f}")
print(f"   Recall:    {rec:.4f}")
print(f"   F1-Score:  {f1:.4f}")
print(f"   ROC-AUC:   {roc_auc:.4f}")

print(f"\nüìà Correlation with Ground Truth Scores:")
print(f"   Pearson:  {pearson_corr:.4f}")
print(f"   Spearman: {spearman_corr:.4f}")

print("\n" + "="*50)
print("üìã Classification Report:")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['Different (0)', 'Similar (1)']))

In [None]:
# =============================================================================
# CELL 11: VISUALISASI EVALUASI
# =============================================================================

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
            xticklabels=['Pred Different', 'Pred Similar'],
            yticklabels=['Actual Different', 'Actual Similar'])
axes[0, 0].set_title('Confusion Matrix', fontsize=12)
axes[0, 0].set_ylabel('Actual Label')
axes[0, 0].set_xlabel('Predicted Label')

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, -pred_distances)
axes[0, 1].plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=1)
axes[0, 1].set_title('ROC Curve', fontsize=12)
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Distance Distribution
axes[1, 0].hist(pred_distances[y_test == 1], bins=30, alpha=0.7, label='Similar pairs', color='green')
axes[1, 0].hist(pred_distances[y_test == 0], bins=30, alpha=0.7, label='Different pairs', color='red')
axes[1, 0].axvline(x=best_threshold, color='black', linestyle='--', linewidth=2, label=f'Threshold={best_threshold:.2f}')
axes[1, 0].set_title('Distance Distribution by Label', fontsize=12)
axes[1, 0].set_xlabel('Euclidean Distance')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# 4. Predicted Distance vs Ground Truth Score
axes[1, 1].scatter(test_scores, pred_distances, alpha=0.5, s=10)
z = np.polyfit(test_scores, pred_distances, 1)
p = np.poly1d(z)
axes[1, 1].plot(test_scores, p(test_scores), "r--", linewidth=2, label=f'Trend line')
axes[1, 1].set_title(f'Distance vs Ground Truth\n(Pearson: {pearson_corr:.3f}, Spearman: {spearman_corr:.3f})', fontsize=12)
axes[1, 1].set_xlabel('Ground Truth Score (0-5)')
axes[1, 1].set_ylabel('Predicted Distance')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# =============================================================================
# CELL 12: DEMO SIMILARITY CHECK
# =============================================================================

def check_similarity(text1, text2, threshold=None):
    """
    Mengukur similarity antara dua teks menggunakan trained Siamese Network.
    """
    if threshold is None:
        threshold = best_threshold
    
    # Tokenize dan pad
    seq1 = tokenizer.texts_to_sequences([text1])
    seq2 = tokenizer.texts_to_sequences([text2])
    pad1 = pad_sequences(seq1, maxlen=MAX_LEN, padding='post')
    pad2 = pad_sequences(seq2, maxlen=MAX_LEN, padding='post')
    
    # Predict distance
    distance = siamese_model.predict([pad1, pad2], verbose=0)[0][0]
    
    # Verdict
    verdict = "‚úÖ MIRIP" if distance < threshold else "‚ùå BEDA"
    confidence = max(0, min(100, (1 - distance/2) * 100))  # Rough confidence
    
    print(f"A: {text1}")
    print(f"B: {text2}")
    print(f"Distance: {distance:.4f} (Threshold: {threshold:.2f})")
    print(f"Verdict: {verdict} (Confidence: {confidence:.1f}%)")
    print()
    
    return distance

print("=" * 60)
print("üß† DEMO: SIMILARITY METRIC LEARNING")
print("=" * 60 + "\n")

# Test Cases - IT & Brainstorming Context
print("--- KASUS 1: Paraphrase (Harusnya MIRIP) ---")
check_similarity("Login button is not working", "Cannot click the login button")

print("--- KASUS 2: Same Topic (Harusnya MIRIP) ---")
check_similarity("Server returned 500 error", "Internal server error occurred")

print("--- KASUS 3: Different Topics (Harusnya BEDA) ---")
check_similarity("Fix the login bug", "Order pizza for lunch")

print("--- KASUS 4: Similar Structure, Different Meaning (TRICKY) ---")
check_similarity("How to learn Python?", "How to learn Java?")

In [None]:
# =============================================================================
# CELL 13: APLIKASI METRIC LEARNING UNTUK GROUPING IDE (IMPROVED)
# =============================================================================

def group_ideas_with_metric_learning(ideas, similarity_threshold=None, show_analysis=True):
    """
    Mengelompokkan ide berdasarkan similarity menggunakan trained Siamese Network.
    
    Algoritma:
    1. Hitung pairwise distance untuk semua pasangan ide
    2. Analisis distribusi distance untuk menentukan threshold optimal
    3. Build similarity graph (edge jika distance < threshold)
    4. Find connected components sebagai groups
    
    Ini BUKAN clustering algorithm terpisah - ini menggunakan hasil metric learning!
    """
    n = len(ideas)
    if n == 0:
        return {}
    
    # Tokenize semua ide
    seqs = tokenizer.texts_to_sequences(ideas)
    padded = pad_sequences(seqs, maxlen=MAX_LEN, padding='post')
    
    # Hitung pairwise distances
    print(f"üîÑ Computing pairwise distances for {n} ideas...")
    distance_matrix = np.zeros((n, n))
    all_distances = []
    
    for i in range(n):
        for j in range(i+1, n):
            dist = siamese_model.predict(
                [padded[i:i+1], padded[j:j+1]], verbose=0
            )[0][0]
            distance_matrix[i, j] = dist
            distance_matrix[j, i] = dist
            all_distances.append(dist)
    
    all_distances = np.array(all_distances)
    
    # --- ANALISIS DISTANCE DISTRIBUTION ---
    if show_analysis:
        print(f"\nüìä ANALISIS DISTANCE:")
        print(f"   Min distance: {all_distances.min():.4f}")
        print(f"   Max distance: {all_distances.max():.4f}")
        print(f"   Mean distance: {all_distances.mean():.4f}")
        print(f"   Median distance: {np.median(all_distances):.4f}")
        print(f"   Std deviation: {all_distances.std():.4f}")
    
    # --- ADAPTIVE THRESHOLD FOR GROUPING ---
    # PENTING: Threshold untuk grouping HARUS lebih ketat dari classification!
    # Karena kita ingin hanya ide yang BENAR-BENAR mirip yang di-group
    
    if similarity_threshold is None:
        # Gunakan percentile yang lebih ketat
        percentile_25 = np.percentile(all_distances, 25)
        percentile_10 = np.percentile(all_distances, 10)
        percentile_5 = np.percentile(all_distances, 5)
        
        # Pilih threshold yang sangat ketat untuk grouping
        # Hanya top 10-25% pasangan terdekat yang dianggap same group
        if all_distances.mean() > 1.0:
            # Distribusi distance tinggi - model kurang discriminative
            similarity_threshold = percentile_5
        elif all_distances.mean() > 0.7:
            similarity_threshold = percentile_10
        else:
            similarity_threshold = percentile_25
        
        # HARD CAP - jangan pernah lebih dari 0.4 untuk grouping
        similarity_threshold = min(similarity_threshold, 0.4)
        
        if show_analysis:
            print(f"\nüéØ THRESHOLD SELECTION (GROUPING):")
            print(f"   5th percentile:  {percentile_5:.4f}")
            print(f"   10th percentile: {percentile_10:.4f}")
            print(f"   25th percentile: {percentile_25:.4f}")
            print(f"   ‚û°Ô∏è  Selected threshold: {similarity_threshold:.4f}")
            print(f"   (Lebih ketat dari classification threshold!)")
    else:
        if show_analysis:
            print(f"\nüéØ Using manual threshold: {similarity_threshold:.4f}")
    
    # Build adjacency based on threshold
    adjacency = distance_matrix < similarity_threshold
    np.fill_diagonal(adjacency, False)  # Jangan connect node ke dirinya sendiri
    
    # Find connected components (Union-Find style)
    visited = [False] * n
    groups = []
    
    def dfs(node, group):
        visited[node] = True
        group.append(node)
        for neighbor in range(n):
            if not visited[neighbor] and adjacency[node, neighbor]:
                dfs(neighbor, group)
    
    for i in range(n):
        if not visited[i]:
            group = []
            dfs(i, group)
            groups.append(group)
    
    # Format output - pisahkan grouped dan ungrouped
    result = {
        "groups": {},
        "distance_matrix": distance_matrix,
        "threshold_used": similarity_threshold,
        "n_groups": 0,
        "all_distances": all_distances
    }
    
    group_counter = 0
    ungrouped_items = []
    
    for group in groups:
        if len(group) > 1:  # Hanya grup dengan >1 member
            group_counter += 1
            group_name = f"Group_{group_counter}"
            result["groups"][group_name] = []
            for i in group:
                result["groups"][group_name].append({"index": i, "text": ideas[i]})
        else:
            ungrouped_items.extend(group)
    
    result["n_groups"] = group_counter
    
    # Tambahkan ungrouped items
    if ungrouped_items:
        result["groups"]["Ungrouped"] = []
        for i in ungrouped_items:
            result["groups"]["Ungrouped"].append({"index": i, "text": ideas[i]})
    
    return result

# Demo
print("=" * 60)
print("üéØ APLIKASI: GROUPING IDE DENGAN METRIC LEARNING")
print("=" * 60 + "\n")

test_ideas = [
    # Group 1: Login/Auth issues (sangat mirip)
    "Login button is broken",
    "Cannot access my account",
    "Password reset not working",
    
    # Group 2: Server issues (sangat mirip)
    "Server returned 500 error",
    "API is not responding",
    
    # Group 3: UI changes (mungkin mirip)
    "Change the button color",
    "Make the logo bigger",
    
    # Outlier (harusnya tidak masuk grup manapun)
    "Order lunch for the team"
]

result = group_ideas_with_metric_learning(test_ideas)

print(f"\n" + "=" * 40)
print(f"üìä HASIL GROUPING: {result['n_groups']} groups found")
print(f"   (Threshold used: {result['threshold_used']:.4f})")
print("=" * 40 + "\n")

for group_name, items in result["groups"].items():
    emoji = "üîπ" if group_name != "Ungrouped" else "‚ö™"
    print(f"{emoji} {group_name}:")
    for item in items:
        print(f"   [{item['index']}] {item['text']}")
    print()

In [None]:
# =============================================================================
# CELL 14: VISUALISASI DISTANCE MATRIX & THRESHOLD ANALYSIS
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Distance Matrix Heatmap
ax1 = axes[0]
sns.heatmap(
    result['distance_matrix'],
    xticklabels=[f"{i}" for i in range(len(test_ideas))],
    yticklabels=[idea[:20] + "..." if len(idea) > 20 else idea for idea in test_ideas],
    cmap='RdYlGn_r',  # Reversed: hijau = jarak kecil (mirip)
    annot=True,
    fmt='.2f',
    vmin=0,
    ax=ax1
)
ax1.set_title(f'Distance Matrix\n(Threshold: {result["threshold_used"]:.3f})', fontsize=12)
ax1.set_xlabel('Idea Index')

# 2. Distance Distribution Histogram
ax2 = axes[1]
ax2.hist(result['all_distances'], bins=15, color='steelblue', edgecolor='black', alpha=0.7)
ax2.axvline(x=result['threshold_used'], color='red', linestyle='--', linewidth=2, 
            label=f'Grouping Threshold = {result["threshold_used"]:.3f}')
ax2.axvline(x=best_threshold, color='orange', linestyle='--', linewidth=2,
            label=f'Classification Threshold = {best_threshold:.3f}')
ax2.set_title('Distance Distribution (Pairwise)', fontsize=12)
ax2.set_xlabel('Euclidean Distance')
ax2.set_ylabel('Frequency')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° CATATAN:")
print("   - Warna HIJAU di matrix = jarak KECIL (ide mirip)")
print("   - Warna MERAH di matrix = jarak BESAR (ide beda)")
print(f"   - Pasangan dengan distance < {result['threshold_used']:.3f} akan di-group bersama")

In [None]:
# =============================================================================
# CELL 15: PENGUJIAN DOMAIN SPESIFIK (BRAINSTORMING)
# =============================================================================

print("=" * 60)
print("üìã PENGUJIAN OUT-OF-DOMAIN: STUDI KASUS JAMAL APP")
print("=" * 60 + "\n")

# Dataset test khusus brainstorming
jamal_test_data = [
    # --- KATEGORI 1: IT & BUG REPORT ---
    ("Login button is broken", "Cannot click login", 1),
    ("Server returned 500 error", "Internal server error", 1),
    ("API response time is slow", "Latency is too high", 1),
    
    # --- KATEGORI 2: UI/UX DESIGN ---
    ("Change color to blue", "Update background color", 1),
    ("Make logo bigger", "Increase logo size", 1),
    
    # --- KATEGORI 3: JEBAKAN (Topik Mirip tapi Beda) ---
    ("Fix login bug", "Design login page", 0),
    ("Server is down", "Server is running fast", 0),
    
    # --- KATEGORI 4: BEDA JAUH ---
    ("Server is down", "I want to eat pizza", 0),
    ("Fix CSS style", "Meeting at 9 AM", 0),
    
    # --- KATEGORI 5: PARAPHRASE SULIT ---
    ("I cannot remember my password", "Forgot password feature needed", 1),
]

# Evaluate
correct = 0
results_list = []

for text1, text2, expected in jamal_test_data:
    seq1 = tokenizer.texts_to_sequences([text1])
    seq2 = tokenizer.texts_to_sequences([text2])
    pad1 = pad_sequences(seq1, maxlen=MAX_LEN, padding='post')
    pad2 = pad_sequences(seq2, maxlen=MAX_LEN, padding='post')
    
    dist = siamese_model.predict([pad1, pad2], verbose=0)[0][0]
    pred = 1 if dist < best_threshold else 0
    
    status = "‚úÖ" if pred == expected else "‚ùå"
    correct += 1 if pred == expected else 0
    
    results_list.append({
        "text1": text1,
        "text2": text2,
        "expected": expected,
        "predicted": pred,
        "distance": dist,
        "correct": pred == expected
    })
    
    expected_label = "MIRIP" if expected == 1 else "BEDA"
    pred_label = "MIRIP" if pred == 1 else "BEDA"
    
    print(f"{status} Expected: {expected_label}, Predicted: {pred_label} (d={dist:.3f})")
    print(f"   A: {text1}")
    print(f"   B: {text2}\n")

accuracy = correct / len(jamal_test_data) * 100
print(f"\nüìä AKURASI PADA DOMAIN BRAINSTORMING: {accuracy:.1f}% ({correct}/{len(jamal_test_data)})")

In [None]:
# =============================================================================
# CELL 16: TUNING THRESHOLD MANUAL (EKSPERIMEN)
# =============================================================================

print("=" * 60)
print("üîß EKSPERIMEN: TUNING THRESHOLD UNTUK GROUPING")
print("=" * 60 + "\n")

# Coba beberapa threshold berbeda
test_thresholds = [0.2, 0.3, 0.4, 0.5]

for thresh in test_thresholds:
    print(f"\n--- THRESHOLD: {thresh} ---")
    result_exp = group_ideas_with_metric_learning(test_ideas, similarity_threshold=thresh, show_analysis=False)
    
    print(f"Groups found: {result_exp['n_groups']}")
    for group_name, items in result_exp["groups"].items():
        if group_name != "Ungrouped":
            item_indices = [item['index'] for item in items]
            print(f"   {group_name}: indices {item_indices}")
    
    ungrouped = result_exp["groups"].get("Ungrouped", [])
    if ungrouped:
        print(f"   Ungrouped: {len(ungrouped)} items")

---

## üìù KESIMPULAN

### Algoritma yang Digunakan

| Komponen | Detail |
|----------|--------|
| **Arsitektur** | Siamese Neural Network |
| **Base Network** | Embedding + BiLSTM + Dense |
| **Loss Function** | Contrastive Loss |
| **Metric** | Euclidean Distance |
| **Dataset** | STS Benchmark (Semantic Textual Similarity) |

### Threshold untuk Grouping vs Classification

| Task | Threshold | Keterangan |
|------|-----------|------------|
| Classification | `best_threshold` (~0.5-0.8) | Optimized untuk F1-Score |
| Grouping | Much lower (~0.2-0.4) | Lebih ketat, hanya ide yang BENAR-BENAR mirip |

### Hasil Eksperimen

- Model berhasil belajar representasi semantik dari teks
- Contrastive Loss efektif untuk mendorong pasangan mirip mendekat dan pasangan beda menjauh
- **PENTING**: Threshold untuk grouping HARUS lebih ketat dari classification threshold
- Hasil metric learning dapat diaplikasikan untuk grouping ide tanpa mengubah algoritma

### Aplikasi untuk JAMAL

1. **Input**: List of sticky notes dari canvas
2. **Process**: Hitung pairwise distance menggunakan trained Siamese Network
3. **Threshold Selection**: Gunakan percentile-based adaptive threshold
4. **Output**: Grouping berdasarkan similarity threshold

**Catatan Penting**: Grouping di sini BUKAN menggunakan algoritma clustering terpisah, melainkan memanfaatkan hasil dari **Similarity Metric Learning** untuk membangun graph similarity dan menemukan connected components.