# ISCA-k v2 - Notebook de Benchmark

Este notebook e o **ponto central de avaliacao** do metodo ISCA-k.

## Objectivo
Testar sistematicamente o efeito de cada modificacao no metodo:
1. **Fase 0**: Diagnostico (tipos, missingness, estrutura)
2. **Fase 1**: Efeito da MI (Mutual Information) nos pesos
3. **Fase 2**: Efeito da PDS (Partial Distance Strategy)
4. **Fase 3**: Efeito do k adaptativo
5. etc.

## Estrutura do Notebook
1. **Configuracao** - Imports e seed
2. **Funcoes de Missingness** - MCAR, MAR, MNAR (documentadas)
3. **Funcoes de Metricas** - R2, NRMSE, Accuracy (documentadas)
4. **Datasets** - Carregar dados de teste
5. **Metodos de Imputacao** - ISCA-k e baselines
6. **Benchmark** - Execucao sistematica
7. **Resultados** - Analise e visualizacao

---
## 1. Configuracao

In [None]:
import numpy as np
import pandas as pd
import time
import warnings
from typing import Dict, List, Tuple, Optional
from scipy.stats import pearsonr
from sklearn.metrics import r2_score, accuracy_score
from sklearn.datasets import load_iris, load_diabetes, load_wine
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

# Seed para reproducibilidade
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("Configuracao OK")

---
## 2. Funcoes de Insercao de Missings

### Padroes de Missingness

| Padrao | Descricao | P(missing) |
|--------|-----------|------------|
| **MCAR** | Missing Completely At Random | Constante, independente de X |
| **MAR** | Missing At Random | Depende de variaveis OBSERVADAS |
| **MNAR** | Missing Not At Random | Depende do PROPRIO valor em falta |

### Limitacoes da Implementacao
- **MAR**: Usa apenas 1 coluna driver (cenarios reais podem ter multiplas dependencias)
- **MNAR**: Valores > mediana tem sempre mais missings (cenario real pode ser inverso)
- Missings em diferentes colunas sao gerados independentemente

In [None]:
def introduce_mcar(data: pd.DataFrame, missing_rate: float = 0.2, 
                   random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Introduz missings MCAR (Missing Completely At Random).
    
    Algoritmo:
    1. Gera mascara aleatoria uniforme com P(missing) = missing_rate
    2. Garante que nenhuma linha fica 100% vazia
    3. Garante que nenhuma coluna fica 100% vazia
    
    Args:
        data: DataFrame original (sem missings)
        missing_rate: Taxa de missings desejada (0 a 1)
        random_state: Seed para reproducibilidade
    
    Returns:
        (data_missing, missing_mask): DataFrame com missings e mascara booleana
    """
    np.random.seed(random_state)
    data_missing = data.copy()
    n_rows, n_cols = data.shape
    
    # Criar mascara aleatoria
    mask = np.random.random((n_rows, n_cols)) < missing_rate
    
    # Constraint: pelo menos 1 valor por linha
    for i in range(n_rows):
        if mask[i].all():
            keep_idx = np.random.randint(n_cols)
            mask[i, keep_idx] = False
    
    # Constraint: pelo menos 1 valor por coluna
    for j in range(n_cols):
        if mask[:, j].all():
            keep_idx = np.random.randint(n_rows)
            mask[keep_idx, j] = False
    
    # Aplicar mascara
    missing_mask = pd.DataFrame(mask, index=data.index, columns=data.columns)
    for col in data.columns:
        data_missing.loc[missing_mask[col], col] = np.nan
    
    return data_missing, missing_mask


def introduce_mar(data: pd.DataFrame, missing_rate: float = 0.2,
                  random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Introduz missings MAR (Missing At Random).
    
    Algoritmo:
    1. Selecciona primeira coluna numerica como driver
    2. Se driver > mediana: P(missing) = rate * 1.5
    3. Se driver <= mediana: P(missing) = rate * 0.5
    4. Driver nunca tem missings (para manter dependencia observavel)
    """
    np.random.seed(random_state)
    data_missing = data.copy()
    n_rows, n_cols = data.shape
    
    # Encontrar coluna driver (primeira numerica)
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) == 0:
        return introduce_mcar(data, missing_rate, random_state)
    
    driver_col = numeric_cols[0]
    driver_values = data[driver_col].values
    driver_median = np.nanmedian(driver_values)
    
    prob_high = min(missing_rate * 1.5, 0.95)
    prob_low = max(missing_rate * 0.5, 0.0)
    
    mask = np.zeros((n_rows, n_cols), dtype=bool)
    
    for i in range(n_rows):
        prob = prob_high if driver_values[i] > driver_median else prob_low
        for j, col in enumerate(data.columns):
            if col == driver_col:
                continue
            if np.random.random() < prob:
                mask[i, j] = True
    
    # Constraints
    for i in range(n_rows):
        if mask[i].all():
            keep_idx = np.random.randint(n_cols)
            mask[i, keep_idx] = False
    
    for j in range(n_cols):
        if mask[:, j].all():
            keep_idx = np.random.randint(n_rows)
            mask[keep_idx, j] = False
    
    missing_mask = pd.DataFrame(mask, index=data.index, columns=data.columns)
    for col in data.columns:
        data_missing.loc[missing_mask[col], col] = np.nan
    
    return data_missing, missing_mask


def introduce_mnar(data: pd.DataFrame, missing_rate: float = 0.2,
                   random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Introduz missings MNAR (Missing Not At Random).
    
    Algoritmo:
    1. Para cada coluna independentemente:
    2. Se valor > mediana: P(missing) = rate * 1.5
    3. Se valor <= mediana: P(missing) = rate * 0.5
    """
    np.random.seed(random_state)
    data_missing = data.copy()
    n_rows, n_cols = data.shape
    
    prob_high = min(missing_rate * 1.5, 0.95)
    prob_low = max(missing_rate * 0.5, 0.0)
    
    mask = np.zeros((n_rows, n_cols), dtype=bool)
    
    for j, col in enumerate(data.columns):
        col_data = data[col]
        
        if pd.api.types.is_numeric_dtype(col_data):
            col_median = col_data.median()
            for i in range(n_rows):
                prob = prob_high if col_data.iloc[i] > col_median else prob_low
                if np.random.random() < prob:
                    mask[i, j] = True
        else:
            mode_val = col_data.mode().iloc[0] if len(col_data.mode()) > 0 else None
            for i in range(n_rows):
                prob = prob_high if col_data.iloc[i] == mode_val else prob_low
                if np.random.random() < prob:
                    mask[i, j] = True
    
    # Constraints
    for i in range(n_rows):
        if mask[i].all():
            keep_idx = np.random.randint(n_cols)
            mask[i, keep_idx] = False
    
    for j in range(n_cols):
        if mask[:, j].all():
            keep_idx = np.random.randint(n_rows)
            mask[keep_idx, j] = False
    
    missing_mask = pd.DataFrame(mask, index=data.index, columns=data.columns)
    for col in data.columns:
        data_missing.loc[missing_mask[col], col] = np.nan
    
    return data_missing, missing_mask


MISSINGNESS_FUNCTIONS = {
    'MCAR': introduce_mcar,
    'MAR': introduce_mar,
    'MNAR': introduce_mnar
}

print("Funcoes de missingness OK")

---
## 3. Funcoes de Metricas

### Metricas para Variaveis Numericas

| Metrica | Formula | Range | Interpretacao |
|---------|---------|-------|---------------|
| **R2** | 1 - SS_res/SS_tot | (-inf, 1] | 1=perfeito, 0=media, <0=pior que media |
| **Pearson** | cov(y,y')/(s_y*s_y') | [-1, 1] | 1=correlacao perfeita positiva |
| **NRMSE** | RMSE/(max-min) | [0, inf) | 0=perfeito, normalizado pelo range |

### Metricas para Variaveis Categoricas

| Metrica | Formula | Range | Interpretacao |
|---------|---------|-------|---------------|
| **Accuracy** | n_correct/n_total | [0, 1] | 1=perfeito |

### IMPORTANTE
As metricas sao calculadas **apenas nos valores que foram imputados** (usando a missing_mask).

In [None]:
def calculate_nrmse(true_values: np.ndarray, imputed_values: np.ndarray) -> float:
    """Calcula NRMSE (Normalized Root Mean Squared Error)."""
    valid = ~np.isnan(true_values) & ~np.isnan(imputed_values)
    if valid.sum() == 0:
        return np.nan
    
    true_subset = true_values[valid]
    imputed_subset = imputed_values[valid]
    
    rmse = np.sqrt(np.mean((true_subset - imputed_subset) ** 2))
    value_range = true_subset.max() - true_subset.min()
    
    if value_range == 0:
        return np.nan
    
    return rmse / value_range


def calculate_metrics_per_column(original_data: pd.DataFrame, 
                                  imputed_data: pd.DataFrame,
                                  missing_mask: pd.DataFrame,
                                  col_types: Dict[str, str]) -> Dict[str, float]:
    """
    Calcula metricas por coluna, avaliando APENAS os valores imputados.
    
    IMPORTANTE: Usa missing_mask para filtrar apenas os valores que
    foram introduzidos como missing e depois imputados.
    """
    r2_scores = []
    pearson_scores = []
    nrmse_scores = []
    accuracy_scores = []
    
    for col in original_data.columns:
        col_mask = missing_mask[col].values
        if col_mask.sum() == 0:
            continue
        
        true_values = original_data.loc[col_mask, col].values
        imputed_values = imputed_data.loc[col_mask, col].values
        
        is_categorical = col_types.get(col, 'numeric') == 'categorical'
        
        if is_categorical:
            valid = np.array([
                v is not None and str(v).lower() != 'nan' and pd.notna(v)
                for v in imputed_values
            ])
            
            if valid.sum() < 1:
                continue
            
            true_str = [str(x) for x in true_values[valid]]
            imputed_str = [str(x) for x in imputed_values[valid]]
            
            try:
                acc = accuracy_score(true_str, imputed_str)
                accuracy_scores.append(acc)
            except Exception:
                pass
        else:
            try:
                true_float = true_values.astype(float)
                imputed_float = imputed_values.astype(float)
            except (ValueError, TypeError):
                continue
            
            valid = ~np.isnan(imputed_float) & ~np.isnan(true_float)
            if valid.sum() < 2:
                continue
            
            true_subset = true_float[valid]
            imputed_subset = imputed_float[valid]
            
            if np.std(true_subset) > 0:
                try:
                    r2 = r2_score(true_subset, imputed_subset)
                    r2_scores.append(r2)
                except Exception:
                    pass
            
            if np.std(true_subset) > 0 and np.std(imputed_subset) > 0:
                try:
                    corr, _ = pearsonr(true_subset, imputed_subset)
                    if np.isfinite(corr):
                        pearson_scores.append(corr)
                except Exception:
                    pass
            
            nrmse = calculate_nrmse(true_subset, imputed_subset)
            if np.isfinite(nrmse):
                nrmse_scores.append(nrmse)
    
    return {
        'R2': np.mean(r2_scores) if r2_scores else np.nan,
        'Pearson': np.mean(pearson_scores) if pearson_scores else np.nan,
        'NRMSE': np.mean(nrmse_scores) if nrmse_scores else np.nan,
        'Accuracy': np.mean(accuracy_scores) if accuracy_scores else np.nan
    }


print("Funcoes de metricas OK")

---
## 4. Datasets

In [None]:
def load_benchmark_datasets() -> Dict:
    datasets = {}
    
    iris = load_iris()
    datasets['iris'] = {
        'data': pd.DataFrame(iris.data, columns=iris.feature_names),
        'col_types': {col: 'numeric' for col in iris.feature_names},
        'description': 'Iris (150x4, numerico)'
    }
    
    wine = load_wine()
    datasets['wine'] = {
        'data': pd.DataFrame(wine.data, columns=wine.feature_names),
        'col_types': {col: 'numeric' for col in wine.feature_names},
        'description': 'Wine (178x13, numerico)'
    }
    
    diabetes = load_diabetes()
    datasets['diabetes'] = {
        'data': pd.DataFrame(diabetes.data, columns=diabetes.feature_names),
        'col_types': {col: 'numeric' for col in diabetes.feature_names},
        'description': 'Diabetes (442x10, numerico, JA NORMALIZADO)'
    }
    
    return datasets


DATASETS = load_benchmark_datasets()

print("Datasets carregados:")
for name, info in DATASETS.items():
    shape = info['data'].shape
    print(f"  {name}: {shape[0]} amostras x {shape[1]} features")

---
## 5. Metodos de Imputacao

In [None]:
def impute_knn(data_missing: pd.DataFrame, n_neighbors: int = 5) -> pd.DataFrame:
    """Imputa usando KNN do sklearn."""
    data_encoded = data_missing.copy()
    encoders = {}
    
    for col in data_encoded.columns:
        if data_encoded[col].dtype == 'object' or data_encoded[col].dtype.name == 'category':
            le = LabelEncoder()
            non_null = data_encoded[col].dropna()
            if len(non_null) > 0:
                le.fit(non_null)
                encoders[col] = le
                mask = data_encoded[col].notna()
                data_encoded.loc[mask, col] = le.transform(data_encoded.loc[mask, col])
            data_encoded[col] = pd.to_numeric(data_encoded[col], errors='coerce')
    
    imputer = KNNImputer(n_neighbors=n_neighbors)
    imputed_array = imputer.fit_transform(data_encoded)
    
    result = pd.DataFrame(imputed_array, columns=data_missing.columns, index=data_missing.index)
    
    for col, le in encoders.items():
        result[col] = result[col].round().astype(int)
        result[col] = result[col].clip(0, len(le.classes_) - 1)
        result[col] = le.inverse_transform(result[col])
    
    return result


def impute_mice(data_missing: pd.DataFrame, max_iter: int = 10) -> pd.DataFrame:
    """Imputa usando MICE (IterativeImputer) do sklearn."""
    data_encoded = data_missing.copy()
    encoders = {}
    
    for col in data_encoded.columns:
        if data_encoded[col].dtype == 'object' or data_encoded[col].dtype.name == 'category':
            le = LabelEncoder()
            non_null = data_encoded[col].dropna()
            if len(non_null) > 0:
                le.fit(non_null)
                encoders[col] = le
                mask = data_encoded[col].notna()
                data_encoded.loc[mask, col] = le.transform(data_encoded.loc[mask, col])
            data_encoded[col] = pd.to_numeric(data_encoded[col], errors='coerce')
    
    imputer = IterativeImputer(max_iter=max_iter, random_state=42)
    imputed_array = imputer.fit_transform(data_encoded)
    
    result = pd.DataFrame(imputed_array, columns=data_missing.columns, index=data_missing.index)
    
    for col, le in encoders.items():
        result[col] = result[col].round().astype(int)
        result[col] = result[col].clip(0, len(le.classes_) - 1)
        result[col] = le.inverse_transform(result[col])
    
    return result


def impute_iscak(data_missing: pd.DataFrame) -> pd.DataFrame:
    """
    Imputa usando ISCA-k.
    
    VERSAO ACTUAL: KNN simples (baseline)
    
    TODO: Modificar progressivamente para testar:
    - [ ] Pesos MI
    - [ ] PDS
    - [ ] k adaptativo
    """
    return impute_knn(data_missing, n_neighbors=5)


IMPUTATION_METHODS = {
    'KNN': impute_knn,
    'MICE': impute_mice,
    'ISCA-k': impute_iscak
}

print(f"Metodos de imputacao: {list(IMPUTATION_METHODS.keys())}")

---
## 6. Benchmark

In [None]:
def run_single_experiment(data_original, col_types, method_name, method_fn,
                          missing_rate, pattern, random_state):
    data_missing, missing_mask = MISSINGNESS_FUNCTIONS[pattern](
        data_original, missing_rate, random_state
    )
    
    start = time.time()
    try:
        data_imputed = method_fn(data_missing)
        elapsed = time.time() - start
    except Exception as e:
        return {'error': str(e)}
    
    metrics = calculate_metrics_per_column(
        data_original, data_imputed, missing_mask, col_types
    )
    
    actual_rate = missing_mask.sum().sum() / missing_mask.size
    
    return {**metrics, 'Time_s': elapsed, 'Actual_Rate': actual_rate}


def run_benchmark(datasets=None, methods=None,
                  missing_rates=[0.1, 0.2, 0.3, 0.4],
                  patterns=['MCAR', 'MAR'],
                  n_runs=3, verbose=True):
    if datasets is None:
        datasets = DATASETS
    if methods is None:
        methods = IMPUTATION_METHODS
    
    results = []
    total = len(datasets) * len(methods) * len(missing_rates) * len(patterns) * n_runs
    current = 0
    
    for ds_name, ds_info in datasets.items():
        data = ds_info['data']
        col_types = ds_info['col_types']
        
        for pattern in patterns:
            for rate in missing_rates:
                for method_name, method_fn in methods.items():
                    for run in range(n_runs):
                        current += 1
                        seed = RANDOM_SEED + run
                        
                        if verbose:
                            print(f"\r[{current}/{total}] {ds_name} | {pattern} | {rate:.0%} | {method_name}", end="")
                        
                        metrics = run_single_experiment(
                            data, col_types, method_name, method_fn,
                            rate, pattern, seed
                        )
                        
                        results.append({
                            'Dataset': ds_name,
                            'Pattern': pattern,
                            'Missing_Rate': f"{int(rate*100)}%",
                            'Method': method_name,
                            'Run': run,
                            **metrics
                        })
    
    if verbose:
        print("\nConcluido!")
    
    return pd.DataFrame(results)


print("Funcao de benchmark OK")

---
## 7. Executar Benchmark

In [None]:
MISSING_RATES = [0.1, 0.2, 0.3, 0.4]
PATTERNS = ['MCAR', 'MAR']
N_RUNS = 3

print("Configuracao:")
print(f"  Taxas de missing: {MISSING_RATES}")
print(f"  Padroes: {PATTERNS}")
print(f"  Repeticoes: {N_RUNS}")
print()

results_df = run_benchmark(
    missing_rates=MISSING_RATES,
    patterns=PATTERNS,
    n_runs=N_RUNS
)

results_df.to_csv('benchmark_results.csv', index=False)
print(f"\nResultados guardados em benchmark_results.csv")

---
## 8. Analise de Resultados

In [None]:
print("=" * 60)
print("RESUMO POR METODO")
print("=" * 60)

summary = results_df.groupby('Method').agg({
    'R2': ['mean', 'std'],
    'NRMSE': ['mean', 'std'],
    'Time_s': 'mean'
}).round(4)

display(summary)

In [None]:
print("\n" + "=" * 60)
print("RESUMO POR DATASET")
print("=" * 60)

for ds in results_df['Dataset'].unique():
    print(f"\n--- {ds.upper()} ---")
    ds_data = results_df[results_df['Dataset'] == ds]
    summary_ds = ds_data.groupby('Method').agg({
        'R2': 'mean',
        'NRMSE': 'mean'
    }).round(4)
    display(summary_ds)

In [None]:
print("\n" + "=" * 60)
print("ISCA-k vs BASELINES")
print("=" * 60)

iscak_r2 = results_df[results_df['Method'] == 'ISCA-k']['R2'].mean()
knn_r2 = results_df[results_df['Method'] == 'KNN']['R2'].mean()
mice_r2 = results_df[results_df['Method'] == 'MICE']['R2'].mean()

print(f"\nR2 medio:")
print(f"  ISCA-k: {iscak_r2:.4f}")
print(f"  KNN:    {knn_r2:.4f}")
print(f"  MICE:   {mice_r2:.4f}")

print(f"\nDiferenca ISCA-k vs KNN:  {iscak_r2 - knn_r2:+.4f}")
print(f"Diferenca ISCA-k vs MICE: {iscak_r2 - mice_r2:+.4f}")

---
## 9. Notas e Proximos Passos

### Estado Actual
- ISCA-k = KNN (baseline)
- Metricas avaliadas apenas em valores imputados
- Funcoes de missingness documentadas

### Proximas Modificacoes a Testar
1. **MI (Mutual Information)**: Modificar `impute_iscak` para usar pesos MI
2. **PDS**: Adicionar Partial Distance Strategy
3. **k adaptativo**: Variar k baseado em densidade/consistencia

### Como Usar Este Notebook
1. Modificar `impute_iscak()` com nova funcionalidade
2. Correr celulas 7 e 8
3. Analisar diferencas nos resultados
4. Documentar conclusoes