# Extractive Summarization Model Comparison

This notebook implements a comparative study of various machine learning models for extractive summarization. We treat the summarization task as a binary classification problem at the sentence level.

**Goal:** Predict whether a sentence belongs to the summary (`Label_Final = 1`) or not (`0`) based on extracted features.

**Models Compared:**
1. Support Vector Machine (SVM) - *Primary Interest*
2. Random Forest (RF)
3. Logistic Regression (LR)
4. K-Nearest Neighbors (KNN)
5. Naive Bayes (NB)

**Methodology:**
- **Baseline:** Default hyperparameters.
- **Optimization:** Hyperparameter tuning using Optuna.
- **Evaluation:** Stratified K-Fold Cross-Validation (k=5).
- **Metrics:** Accuracy, Precision, Recall, F1-Score, Cohen's Kappa.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_preprocessing import prepare_data
from src.model_trainer import get_models, evaluate_model
from src.optimizer import optimize_model

# Set plot style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

## 1. Data Loading and Preprocessing

We load the dataset, handle missing values, and scale the features. Scaling is crucial for distance-based models like SVM and KNN.

In [None]:
# Load and prepare data
X, y = prepare_data('korpus_mentah_new.csv')

print("Feature shape:", X.shape)
print("Target distribution:\n", y.value_counts(normalize=True))
X.head()

## 2. Baseline Model Evaluation

We evaluate all 5 models using their default hyperparameters to establish a baseline performance.

In [None]:
baseline_models = get_models()
baseline_results = []

print("Running Baseline Evaluation...")
for name, model in baseline_models.items():
    print(f"Evaluating {name}...")
    metrics = evaluate_model(model, X, y)
    metrics['Model'] = name
    metrics['Type'] = 'Baseline'
    baseline_results.append(metrics)

df_baseline = pd.DataFrame(baseline_results)
df_baseline

## 3. Hyperparameter Optimization (Optuna)

We use Optuna to find the optimal hyperparameters for each model. We maximize the **F1-Score**, as it balances Precision and Recall, which is important for summarization tasks (we want to select relevant sentences without missing too many).

In [None]:
optimized_results = []
best_params_log = {}

print("Running Optuna Optimization (this may take a while)...")
for name in baseline_models.keys():
    print(f"Optimizing {name}...")
    # Run optimization (50 trials per model)
    best_model, best_params, best_score = optimize_model(name, X, y, n_trials=50)
    
    best_params_log[name] = best_params
    print(f"  Best F1: {best_score:.4f}")
    
    # Evaluate the optimized model using the same CV strategy
    metrics = evaluate_model(best_model, X, y)
    metrics['Model'] = name
    metrics['Type'] = 'Optimized'
    optimized_results.append(metrics)

df_optimized = pd.DataFrame(optimized_results)
df_optimized

### Best Hyperparameters Found

In [None]:
for name, params in best_params_log.items():
    print(f"--- {name} ---")
    print(params)
    print("")

## 4. Comparison & Analysis

We combine the results to compare the performance improvement.

In [None]:
df_final = pd.concat([df_baseline, df_optimized], ignore_index=True)
df_final.sort_values(by=['Model', 'Type'], inplace=True)
df_final

### Visualization: F1-Score Comparison

In [None]:
plt.figure(figsize=(14, 7))
sns.barplot(data=df_final, x='Model', y='F1-Score', hue='Type', palette='viridis')
plt.title('Model Comparison: Baseline vs Optimized (F1-Score)', fontsize=16)
plt.ylim(0, 1.0)
plt.legend(title='Configuration', loc='lower right')
plt.grid(axis='y', alpha=0.3)
plt.show()

### Visualization: Cohen's Kappa Comparison
Cohen's Kappa is a robust metric for inter-rater agreement, useful here to see how much better the model is compared to random chance.

In [None]:
plt.figure(figsize=(14, 7))
sns.barplot(data=df_final, x='Model', y='Kappa', hue='Type', palette='magma')
plt.title("Model Comparison: Baseline vs Optimized (Cohen's Kappa)", fontsize=16)
plt.ylim(0, 1.0)
plt.legend(title='Configuration', loc='lower right')
plt.grid(axis='y', alpha=0.3)
plt.show()

## 5. Conclusion

The tables and charts above summarize the performance of SVM against Random Forest, Logistic Regression, KNN, and Naive Bayes.

**Key Observations:**
1. **Impact of Optimization:** Check if Optuna significantly improved the scores. SVM and KNN often benefit most from scaling and tuning.
2. **Best Model:** Identify which model achieved the highest F1-Score and Kappa.
3. **Consistency:** Look at the stability of results.

This analysis helps select the most robust model for the text summarization pipeline.