# ğŸš€ NASA Defect Prediction: MC2

**Dataset:** MC2
**Method:** Random Forest â†’ KAN â†’ Attention-KAN
**Goal:** Accuracy korunurken Recall artÄ±rma (F2 odaklÄ±)

## Problem TanÄ±mÄ±
NASA yazÄ±lÄ±m hata (defect) veri setleri, bileÅŸenlerin hatalÄ± olup olmadÄ±ÄŸÄ±nÄ± tahmin etmeyi amaÃ§lar. Dengesiz sÄ±nÄ±f daÄŸÄ±lÄ±mÄ± nedeniyle **Recall** kritik Ã¶nemdedir. Ancak yalnÄ±zca Recall'u artÄ±rmak, Accuracy'yi dÃ¼ÅŸÃ¼rebilir. Bu notebook'ta Random Forest, KAN ve Attention-KAN modellerini karÅŸÄ±laÅŸtÄ±rarak Recall/Accuracy dengesini gÃ¶zlemleyeceÄŸiz.


## 1) Veri HazÄ±rlama
- Google Drive baÄŸlantÄ±sÄ±
- ARFF dosyasÄ±nÄ± okuma
- Etiket dÃ¶nÃ¼ÅŸÃ¼mÃ¼ ve eksik deÄŸer yÃ¶netimi
- Train/Val/Test bÃ¶lÃ¼nmesi (stratified)
- Min-Max Ã¶lÃ§ekleme
- SMOTE ile oversampling (sadece eÄŸitim seti)


In [None]:
!pip install scipy scikit-learn imbalanced-learn pandas numpy torch seaborn matplotlib openpyxl -q

In [None]:
from google.colab import drive

drive.mount('/content/drive')

import os
import numpy as np
import pandas as pd
from scipy.io import arff
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from imblearn.over_sampling import SMOTE

DATASET_NAME = "MC2"
base_path = "/content/drive/MyDrive/nasa_datasets/"  # <- Drive yolunuza gÃ¶re gÃ¼ncelleyin
file_path = os.path.join(base_path, f"{DATASET_NAME}.arff")


def load_arff_data(file_path):
    try:
        data, meta = arff.loadarff(file_path)
        df = pd.DataFrame(data)
        for col in df.select_dtypes([object]).columns:
            try:
                df[col] = df[col].str.decode('utf-8')
            except Exception:
                pass
        return df
    except Exception:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()
        data_start = content.lower().find('@data')
        if data_start == -1:
            raise ValueError("ARFF dosyasÄ±nda @data bÃ¶lÃ¼mÃ¼ bulunamadÄ±.")
        data_section = content[data_start + 5:].strip()
        df = pd.read_csv(StringIO(data_section), header=None)
        return df


df = load_arff_data(file_path)
print(f"{DATASET_NAME} veri seti yÃ¼klendi. Ã–rnek sayÄ±sÄ±: {len(df)}")

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
if y.dtype == object or y.dtype == 'bool' or np.issubdtype(y.dtype, np.str_):
    y = LabelEncoder().fit_transform(y)

X_full_train, X_test, y_full_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_full_train, y_full_train, test_size=0.10, stratify=y_full_train, random_state=42
)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

smote_ratio = 0.8
sm = SMOTE(sampling_strategy=smote_ratio, random_state=42)
X_train_smote, y_train_smote = sm.fit_resample(X_train, y_train)
print(f"SMOTE sonrasÄ± eÄŸitim daÄŸÄ±lÄ±mÄ±: {np.bincount(y_train_smote)}")


## 2) Model 1 â€” Random Forest (Baseline)
DoÄŸrulama seti Ã¼zerinde F1 maksimize edecek eÅŸik seÃ§ilir, ardÄ±ndan Train/Test metrikleri raporlanÄ±r.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_smote, y_train_smote)

y_val_proba = rf.predict_proba(X_val)[:, 1]
best_thresh = 0.5
best_f1 = 0.0
for thresh in np.arange(0.1, 0.91, 0.01):
    y_val_pred = (y_val_proba >= thresh).astype(int)
    f1 = f1_score(y_val, y_val_pred)
    if f1 > best_f1:
        best_f1 = f1
        best_thresh = thresh

print(f"RF optimum eÅŸik: {best_thresh:.2f} (F1={best_f1:.3f})")

y_train_pred_rf = (rf.predict_proba(X_train)[:, 1] >= best_thresh).astype(int)
y_test_pred_rf = (rf.predict_proba(X_test)[:, 1] >= best_thresh).astype(int)

metrics_rf = {
    'Train': {
        'Accuracy': accuracy_score(y_train, y_train_pred_rf),
        'Precision': precision_score(y_train, y_train_pred_rf, zero_division=0),
        'Recall': recall_score(y_train, y_train_pred_rf),
        'F1': f1_score(y_train, y_train_pred_rf),
        'F2': fbeta_score(y_train, y_train_pred_rf, beta=2),
    },
    'Test': {
        'Accuracy': accuracy_score(y_test, y_test_pred_rf),
        'Precision': precision_score(y_test, y_test_pred_rf, zero_division=0),
        'Recall': recall_score(y_test, y_test_pred_rf),
        'F1': f1_score(y_test, y_test_pred_rf),
        'F2': fbeta_score(y_test, y_test_pred_rf, beta=2),
    },
}

print("Random Forest SonuÃ§larÄ±:")
for phase in ['Train', 'Test']:
    m = metrics_rf[phase]
    print(
        f" {phase} -> Accuracy: {m['Accuracy']:.3f}, Precision: {m['Precision']:.3f}, "
        f"Recall: {m['Recall']:.3f}, F1: {m['F1']:.3f}, F2: {m['F2']:.3f}"
    )


## 3) Model 2 â€” KAN (Kolmogorovâ€“Arnold Network)
KAN, spline tabanlÄ± dÃ¶nÃ¼ÅŸÃ¼m katmanlarÄ±yla tabular veride esnek temsil saÄŸlar. Focal Loss ve erken durdurma ile eÄŸitilir.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader


class KANLinear(nn.Module):
    def __init__(self, in_features, out_features, grid_size=5, spline_order=3):
        super().__init__()
        self.grid_size = grid_size
        self.spline_order = spline_order
        self.grid = nn.Parameter(
            torch.linspace(-1, 1, grid_size)
            .unsqueeze(0)
            .unsqueeze(0)
            .repeat(out_features, in_features, 1)
        )
        self.coef = nn.Parameter(torch.randn(out_features, in_features, grid_size + spline_order) * 0.1)
        self.base_weight = nn.Parameter(torch.randn(out_features, in_features) * 0.1)

    def forward(self, x):
        x_expanded = x.unsqueeze(1).unsqueeze(-1)
        grid = self.grid.unsqueeze(0)
        distances = torch.abs(x_expanded - grid)
        basis = torch.relu(1 - distances)
        spline_out = torch.einsum('boij,boij->boi', basis, self.coef[..., : self.grid_size])
        spline_out = spline_out.sum(dim=-1)
        linear_out = torch.matmul(x, self.base_weight.t())
        return linear_out + spline_out


class KANModel(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, grid_size=5, spline_order=3):
        super().__init__()
        self.kan1 = KANLinear(input_dim, hidden_dim, grid_size, spline_order)
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.kan2 = KANLinear(hidden_dim, hidden_dim // 2, grid_size, spline_order)
        self.bn2 = nn.BatchNorm1d(hidden_dim // 2)
        self.fc_out = nn.Linear(hidden_dim // 2, 1)

    def forward(self, x):
        x = torch.relu(self.bn1(self.kan1(x)))
        x = torch.relu(self.bn2(self.kan2(x)))
        return torch.sigmoid(self.fc_out(x))


class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        bce_loss = nn.functional.binary_cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-bce_loss)
        focal_loss = self.alpha * ((1 - pt) ** self.gamma) * bce_loss
        return focal_loss.mean()


def train_model(model, X_train, y_train, X_val, y_val, epochs=50, batch_size=32, lr=0.01):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    train_ds = TensorDataset(
        torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32)
    )
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    criterion = FocalLoss(alpha=0.25, gamma=2.0)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    best_model_state = None
    best_f1 = 0.0
    epochs_no_improve = 0
    for _ in range(epochs):
        model.train()
        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            optimizer.zero_grad()
            y_pred = model(X_batch).view(-1)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()

        model.eval()
        with torch.no_grad():
            X_val_t = torch.tensor(X_val, dtype=torch.float32).to(device)
            val_preds = model(X_val_t).view(-1).cpu().numpy()
        val_pred_labels = (val_preds >= 0.5).astype(int)
        val_f1 = f1_score(y_val, val_pred_labels)
        if val_f1 > best_f1:
            best_f1 = val_f1
            epochs_no_improve = 0
            best_model_state = model.state_dict()
        else:
            epochs_no_improve += 1
        if epochs_no_improve >= 5:
            break

    if best_model_state:
        model.load_state_dict(best_model_state)
    return model


kan_model = KANModel(input_dim=X_train_smote.shape[1], hidden_dim=64, grid_size=5, spline_order=3)
kan_model = train_model(kan_model, X_train_smote, y_train_smote, X_val, y_val)

kan_model.eval()
with torch.no_grad():
    val_probs = kan_model(torch.tensor(X_val, dtype=torch.float32)).view(-1).numpy()

best_thresh = 0.5
best_f1 = 0.0
for t in np.arange(0.1, 0.91, 0.01):
    f1 = f1_score(y_val, (val_probs >= t).astype(int))
    if f1 > best_f1:
        best_f1 = f1
        best_thresh = t

print(f"KAN en iyi eÅŸik: {best_thresh:.2f} (F1={best_f1:.3f})")

with torch.no_grad():
    train_probs_kan = kan_model(torch.tensor(X_train, dtype=torch.float32)).view(-1).numpy()
    test_probs_kan = kan_model(torch.tensor(X_test, dtype=torch.float32)).view(-1).numpy()

y_train_pred_kan = (train_probs_kan >= best_thresh).astype(int)
y_test_pred_kan = (test_probs_kan >= best_thresh).astype(int)

print("KAN SonuÃ§larÄ±:")
print(
    f" Train -> Accuracy: {accuracy_score(y_train, y_train_pred_kan):.3f}, "
    f"Precision: {precision_score(y_train, y_train_pred_kan, zero_division=0):.3f}, "
    f"Recall: {recall_score(y_train, y_train_pred_kan):.3f}, "
    f"F1: {f1_score(y_train, y_train_pred_kan):.3f}"
)
print(
    f" Test  -> Accuracy: {accuracy_score(y_test, y_test_pred_kan):.3f}, "
    f"Precision: {precision_score(y_test, y_test_pred_kan, zero_division=0):.3f}, "
    f"Recall: {recall_score(y_test, y_test_pred_kan):.3f}, "
    f"F1: {f1_score(y_test, y_test_pred_kan):.3f}, "
    f"F2: {fbeta_score(y_test, y_test_pred_kan, beta=2):.3f}"
)


## 4) Model 3 â€” Attention-KAN
Girdi Ã¶zelliklerine dikkat mekanizmasÄ± eklenir. AmaÃ§, kritik Ã¶zelliklere odaklanarak Recall'u artÄ±rÄ±rken Accuracy'yi korumaktÄ±r.


In [None]:
class AttentionKAN(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, grid_size=5, spline_order=3):
        super().__init__()
        self.att_fc1 = nn.Linear(input_dim, input_dim)
        self.att_fc2 = nn.Linear(input_dim, input_dim)
        self.kan1 = KANLinear(input_dim, hidden_dim, grid_size, spline_order)
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.kan2 = KANLinear(hidden_dim, hidden_dim // 2, grid_size, spline_order)
        self.bn2 = nn.BatchNorm1d(hidden_dim // 2)
        self.fc_out = nn.Linear(hidden_dim // 2, 1)

    def forward(self, x):
        att_scores = torch.relu(self.att_fc1(x))
        att_scores = torch.sigmoid(self.att_fc2(att_scores))
        x_att = x * att_scores
        x = torch.relu(self.bn1(self.kan1(x_att)))
        x = torch.relu(self.bn2(self.kan2(x)))
        return torch.sigmoid(self.fc_out(x))


att_kan_model = AttentionKAN(input_dim=X_train_smote.shape[1], hidden_dim=64, grid_size=5, spline_order=3)
att_kan_model = train_model(att_kan_model, X_train_smote, y_train_smote, X_val, y_val)

att_kan_model.eval()
with torch.no_grad():
    val_probs = att_kan_model(torch.tensor(X_val, dtype=torch.float32)).view(-1).numpy()

best_thresh = 0.5
best_f1 = 0.0
for t in np.arange(0.1, 0.91, 0.01):
    f1 = f1_score(y_val, (val_probs >= t).astype(int))
    if f1 > best_f1:
        best_f1 = f1
        best_thresh = t

print(f"Attention-KAN en iyi eÅŸik: {best_thresh:.2f} (F1={best_f1:.3f})")

with torch.no_grad():
    train_probs_att = att_kan_model(torch.tensor(X_train, dtype=torch.float32)).view(-1).numpy()
    test_probs_att = att_kan_model(torch.tensor(X_test, dtype=torch.float32)).view(-1).numpy()

y_train_pred_att = (train_probs_att >= best_thresh).astype(int)
y_test_pred_att = (test_probs_att >= best_thresh).astype(int)

print("Attention-KAN SonuÃ§larÄ±:")
print(
    f" Train -> Accuracy: {accuracy_score(y_train, y_train_pred_att):.3f}, "
    f"Precision: {precision_score(y_train, y_train_pred_att, zero_division=0):.3f}, "
    f"Recall: {recall_score(y_train, y_train_pred_att):.3f}, "
    f"F1: {f1_score(y_train, y_train_pred_att):.3f}"
)
print(
    f" Test  -> Accuracy: {accuracy_score(y_test, y_test_pred_att):.3f}, "
    f"Precision: {precision_score(y_test, y_test_pred_att, zero_division=0):.3f}, "
    f"Recall: {recall_score(y_test, y_test_pred_att):.3f}, "
    f"F1: {f1_score(y_test, y_test_pred_att):.3f}, "
    f"F2: {fbeta_score(y_test, y_test_pred_att, beta=2):.3f}"
)


## 5) KarÅŸÄ±laÅŸtÄ±rma
AÅŸaÄŸÄ±da metrikler **Accuracy, Precision, Recall, F1, F2** ÅŸeklinde raporlanÄ±r. Ä°lgili veri seti iÃ§in sonuÃ§larÄ± karÅŸÄ±laÅŸtÄ±rÄ±p yorumlayabilirsiniz.


In [None]:
import pandas as pd

summary = pd.DataFrame(
    [
        {
            'Model': 'Random Forest',
            **metrics_rf['Test'],
        },
        {
            'Model': 'KAN',
            'Accuracy': accuracy_score(y_test, y_test_pred_kan),
            'Precision': precision_score(y_test, y_test_pred_kan, zero_division=0),
            'Recall': recall_score(y_test, y_test_pred_kan),
            'F1': f1_score(y_test, y_test_pred_kan),
            'F2': fbeta_score(y_test, y_test_pred_kan, beta=2),
        },
        {
            'Model': 'Attention-KAN',
            'Accuracy': accuracy_score(y_test, y_test_pred_att),
            'Precision': precision_score(y_test, y_test_pred_att, zero_division=0),
            'Recall': recall_score(y_test, y_test_pred_att),
            'F1': f1_score(y_test, y_test_pred_att),
            'F2': fbeta_score(y_test, y_test_pred_att, beta=2),
        },
    ]
)
summary
