# Part 4 - Tabular Data (MiniBooNE Particle Identification)
Neural networks, especially CNNs, work extremely well on images. However, on tabular data, they are not as effective (see [this paper](https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html)). Tabular data has columns of features which can have different data types like numerical or categorical values.

In this task, the goal is to achieve well performing classifier on the tabular [MiniBoone dataset](https://archive.ics.uci.edu/dataset/199/miniboone+particle+identification). You can reuse all your code from part 3.

In [1]:
import urllib.request
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

import pandas as pd
from scipy.io import arff


# Download the dataset
url = "https://www.openml.org/data/download/19335523/MiniBooNE.arff"
miniboone_path = Path("raw_miniboone/MiniBooNE.arff")
if not miniboone_path.exists():
    miniboone_path.parent.mkdir(parents=True, exist_ok=True)
    urllib.request.urlretrieve(url, miniboone_path)

# Load the dataset
miniboone_arff = arff.loadarff("./raw_miniboone/MiniBooNE.arff")[0]
miniboone_df = pd.DataFrame(miniboone_arff)
miniboone_df.head()

Unnamed: 0,signal,ParticleID_0,ParticleID_1,ParticleID_2,ParticleID_3,ParticleID_4,ParticleID_5,ParticleID_6,ParticleID_7,ParticleID_8,...,ParticleID_40,ParticleID_41,ParticleID_42,ParticleID_43,ParticleID_44,ParticleID_45,ParticleID_46,ParticleID_47,ParticleID_48,ParticleID_49
0,b'True',2.59413,0.468803,20.6916,0.322648,0.009682,0.374393,0.803479,0.896592,3.59665,...,101.174,-31.373,0.442259,5.86453,0.0,0.090519,0.176909,0.457585,0.071769,0.245996
1,b'True',3.86388,0.645781,18.1375,0.233529,0.030733,0.361239,1.06974,0.878714,3.59243,...,186.516,45.9597,-0.478507,6.11126,0.001182,0.0918,-0.465572,0.935523,0.333613,0.230621
2,b'True',3.38584,1.19714,36.0807,0.200866,0.017341,0.260841,1.10895,0.884405,3.43159,...,129.931,-11.5608,-0.297008,8.27204,0.003854,0.141721,-0.210559,1.01345,0.255512,0.180901
3,b'True',4.28524,0.510155,674.201,0.281923,0.009174,0.0,0.998822,0.82339,3.16382,...,163.978,-18.4586,0.453886,2.48112,0.0,0.180938,0.407968,4.34127,0.473081,0.25899
4,b'True',5.93662,0.832993,59.8796,0.232853,0.025066,0.233556,1.37004,0.787424,3.66546,...,229.555,42.96,-0.975752,2.66109,0.0,0.170836,-0.814403,4.67949,1.92499,0.253893


In [None]:
# Convert to numpy data and labels
X_unfiltered = miniboone_df.drop(columns="signal").to_numpy()           # Removes the "signal" column from the DataFrame, shape (n_samples, n_features)
y_unfiltered = (miniboone_df["signal"] == b'True').to_numpy().astype(int)     # if signal, =1; array of {1,0} label, 1 for true, 0 for falsch (astype)
print(f"{X_unfiltered.shape=}, {y_unfiltered.shape=}")

print(X_unfiltered)
print(y_unfiltered)

X_unfiltered.shape=(130064, 50), y_unfiltered.shape=(130064,)
[[2.59413e+00 4.68803e-01 2.06916e+01 ... 4.57585e-01 7.17692e-02
  2.45996e-01]
 [3.86388e+00 6.45781e-01 1.81375e+01 ... 9.35523e-01 3.33613e-01
  2.30621e-01]
 [3.38584e+00 1.19714e+00 3.60807e+01 ... 1.01345e+00 2.55512e-01
  1.80901e-01]
 ...
 [3.10842e+00 2.17814e+00 5.63651e+01 ... 7.89276e-01 7.30342e-01
  1.52876e-01]
 [5.44560e+00 1.84570e+00 1.03463e+02 ... 2.87259e+00 8.19867e-01
  2.10619e-01]
 [4.55062e+00 1.34174e+00 8.00887e+01 ... 2.64744e+00 7.42709e-01
  2.76477e-01]]
[1 1 1 ... 0 0 0]


## 4.1 Familiarize Yourself with the Data
Read the dataset description and analyze the data.
- Identify and filter out invalid samples: Locate all the rows where all features (columns) have the value -999. Remove them.
- Split the filtered data into training and testing sets. Use 20% of the data for testing and set the random_state to 42. Use the [train_test_split function from scikit-learn](https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html).
- Which normalization method would you use for this dataset? Why?
- Which metrics would you use to evaluate the model?

test_size represents the proportion of the dataset to include in the test split, random_state controls the shuffling applied to the data before applying the split

In [None]:
# identify and filter out invalid rows
valid_mask = ~(X_unfiltered == -999).all(axis=1)
X_filtered = X_unfiltered[valid_mask]
y_filtered = y_unfiltered[valid_mask]

print(X_filtered.shape)

# split the filtered data
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_filtered, y_filtered, test_size=0.2, random_state=42)

(129596, 50)


In [14]:
# Standard-Normalisation
from sklearn.preprocessing import StandardScaler
scaler1 = StandardScaler()
X_train_std_scaled = scaler1.fit_transform(X_train_raw)
X_test_std_scaled = scaler1.transform(X_test_raw)

# Min-max Normalisation
from sklearn.preprocessing import MinMaxScaler
scaler2 = MinMaxScaler()
scaler2.fit(X_train_raw)
X_train_minmax_scaled = scaler2.fit_transform(X_train_raw)
X_test_minmax_scaled = scaler2.transform(X_test_raw)

Metrics choice depends on whether the classification is balanced or imbalanced: If labels are balanced: Accuracy, F1 score, Confusion matrix. If labels are imbalanced, which is common in physics: Precision / Recall, F1 score, ROC AUC (curve FT against TP)

## 4.2 Classical Machine Learning
- Try to achieve a high classification accuracy using non-neural network methods. You should aim for at least 0.936 accuracy. Discuss the results.

In [15]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [16]:
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train_std_scaled, y_train)

In [17]:
prediction_rf = rf.predict(X_test_std_scaled)
print(classification_report(y_test, prediction_rf))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96     18611
           1       0.90      0.88      0.89      7309

    accuracy                           0.94     25920
   macro avg       0.92      0.92      0.92     25920
weighted avg       0.94      0.94      0.94     25920



In [18]:
gb = GradientBoostingClassifier()
gb.fit(X_train_std_scaled, y_train)

In [19]:
prediction_gb = gb.predict(X_test_std_scaled)
print(classification_report(y_test, prediction_gb))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95     18611
           1       0.87      0.87      0.87      7309

    accuracy                           0.93     25920
   macro avg       0.91      0.91      0.91     25920
weighted avg       0.93      0.93      0.93     25920



In [21]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train_std_scaled, y_train)
prediction_knn = knn.predict(X_test_std_scaled)
print(classification_report(y_test, prediction_knn))

              precision    recall  f1-score   support

           0       0.96      0.94      0.95     18611
           1       0.85      0.90      0.87      7309

    accuracy                           0.92     25920
   macro avg       0.90      0.92      0.91     25920
weighted avg       0.93      0.92      0.93     25920



Random Forest:
Best overall performance (highest accuracy and F1 scores) and excellent on both majority (class 0) and minority (class 1). It is likely most robust choice without too much tuning

Gradient Boosting:
Very strong, slightly behind Random Forest. May outperform RF with tuning or on imbalanced data, slower to train but more precise in hard cases

KNN (k=3):
Slightly lower accuracy, performs better on recall for class 1 (0.90) — meaning it's good at catching minority class samples, simpler but slower to predict and sensitive to scale

## 4.3 Neural Networks
- Try to get a high classification accuracy using neural networks.
- See how your neural networks perform compared to your previously used methods.

In [22]:
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

In [36]:
# Define datasets and loaders
batch_size = 40

# Convert to tensors
X_train_tensor = torch.tensor(X_train_std_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_std_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)                          # torch dataset
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)  
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)    # torch data loaders
test_loader = DataLoader(test_dataset, batch_size=batch_size)

print(train_dataset)

<torch.utils.data.dataset.TensorDataset object at 0x0000023A624ABDA0>


In [37]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()        # call the constructor method of parent class
        self.flatten = nn.Flatten()
        self.relu = nn.ReLU()                  # activition function f(x)=max{0,x}
        self.fc1 = nn.Linear(50, 128)     # define 3 layers
        self.fc2 = nn.Linear(128, 10)
        self.fc3 = nn.Linear(10, 2)

    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

In [40]:
model = MLP()   # constructor
# Hyperparameters
learning_rate = 0.001
num_epochs = 10

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)       # stochastic Gradient descent SGD (Impuls) avoiding gradient oszillating

# training loop
model.train()                                                    # enables training-specific behaviour
loss_list=[]
for epoch in range(num_epochs):
    for batch_idx, (data, label) in enumerate(train_loader):     # batch_idx index of current batch
        # loss
        output = model(data)                                   # output after one loop
        loss = criterion(output, label)                        # compute loss function

        # backprop
        loss.backward()                                         # compute gradient using backpropagating (chain rule)
        optimizer.step()                                        # update model parameters
        optimizer.zero_grad()                                   # clear old gradients

        if (batch_idx+1) % 1250 == 0:                            # print loss after every 125 = 8000 / 64 batches
            print(f"Epoch {epoch+1}, Batch {batch_idx+1}, Loss: {loss.item()}")
            loss_list.append(loss.item())

Epoch 1, Batch 1250, Loss: 0.6430441737174988
Epoch 1, Batch 2500, Loss: 0.47522133588790894
Epoch 2, Batch 1250, Loss: 0.38894912600517273
Epoch 2, Batch 2500, Loss: 0.3213590681552887
Epoch 3, Batch 1250, Loss: 0.2849302291870117
Epoch 3, Batch 2500, Loss: 0.4530360698699951
Epoch 4, Batch 1250, Loss: 0.26232102513313293
Epoch 4, Batch 2500, Loss: 0.2453579604625702
Epoch 5, Batch 1250, Loss: 0.11809174716472626
Epoch 5, Batch 2500, Loss: 0.2945480942726135
Epoch 6, Batch 1250, Loss: 0.12929317355155945
Epoch 6, Batch 2500, Loss: 0.18562369048595428
Epoch 7, Batch 1250, Loss: 0.1690026819705963
Epoch 7, Batch 2500, Loss: 0.25737372040748596
Epoch 8, Batch 1250, Loss: 0.07622580230236053
Epoch 8, Batch 2500, Loss: 0.22188958525657654
Epoch 9, Batch 1250, Loss: 0.1687050759792328
Epoch 9, Batch 2500, Loss: 0.10548646748065948
Epoch 10, Batch 1250, Loss: 0.1083785891532898
Epoch 10, Batch 2500, Loss: 0.2078717201948166


In [None]:
num_correct = 0
test_loss = 0
num_samples = len(test_dataset)
num_batches = len(test_loader)

# evaluation
model.eval()                                        # switch to evaluation-mode
with torch.no_grad():
    for eval_data, eval_label in test_loader:
        prediction = model(eval_data)
        test_loss += criterion(prediction, eval_label)
        num_correct += (prediction.argmax(1) == eval_label).sum()
test_loss /= num_batches      # average loss
print(f"Test error: \nAccuracy: {num_correct/num_samples*100:.2f}%\nAverage loss: {test_loss}")

Test error: 
Accuracy: 92.23%
Average loss: 0.19912411272525787


For MLP we got a slight decrease of accuracy comparing to the classical method. However the run time is much smaller than the classical method (from 3+ min to 20s). Therefore in practice we could choose between the classical method and MLP based on our need -- more accuracy or less run time. 
