## Empirical Tuning

### Round 1

In [102]:
import numpy as np
from sklearn.metrics import (
    confusion_matrix, classification_report, roc_auc_score,
    log_loss, balanced_accuracy_score, cohen_kappa_score,
    matthews_corrcoef, f1_score, recall_score, precision_score
)
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import BatchNormalization

In [39]:
print("\n=== Round 1: Baseline CNN ===")

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=input_shape),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')])


=== Round 1: Baseline CNN ===


In [41]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy', tf.keras.metrics.sparse_top_k_categorical_accuracy])

In [43]:
history = model.fit(X_train, y_train, epochs=10, validation_split=0.2, batch_size=32, verbose=2)

Epoch 1/10
187/187 - 7s - 39ms/step - accuracy: 0.5634 - loss: 1.1020 - sparse_top_k_categorical_accuracy: 1.0000 - val_accuracy: 0.5964 - val_loss: 1.0377 - val_sparse_top_k_categorical_accuracy: 1.0000
Epoch 2/10
187/187 - 4s - 23ms/step - accuracy: 0.6518 - loss: 0.8982 - sparse_top_k_categorical_accuracy: 1.0000 - val_accuracy: 0.6539 - val_loss: 0.9106 - val_sparse_top_k_categorical_accuracy: 1.0000
Epoch 3/10
187/187 - 5s - 24ms/step - accuracy: 0.6976 - loss: 0.7964 - sparse_top_k_categorical_accuracy: 1.0000 - val_accuracy: 0.7008 - val_loss: 0.8244 - val_sparse_top_k_categorical_accuracy: 1.0000
Epoch 4/10
187/187 - 4s - 23ms/step - accuracy: 0.7504 - loss: 0.6719 - sparse_top_k_categorical_accuracy: 1.0000 - val_accuracy: 0.7028 - val_loss: 0.8430 - val_sparse_top_k_categorical_accuracy: 1.0000
Epoch 5/10
187/187 - 4s - 23ms/step - accuracy: 0.7929 - loss: 0.5442 - sparse_top_k_categorical_accuracy: 1.0000 - val_accuracy: 0.7135 - val_loss: 0.8389 - val_sparse_top_k_categoric

In [44]:
test_loss, test_acc, test_top3_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Round 1 Test Accuracy: {test_acc:.4f}")
print(f"Round 1 Test Top-3 Accuracy: {test_top3_acc:.4f}")

Round 1 Test Accuracy: 0.6872
Round 1 Test Top-3 Accuracy: 1.0000


## 🔍 Observations

### 🧪 Model Performance
- **Test Accuracy**: `68.7%` – Moderate overall performance.
- **Top-3 Accuracy**: `100%` – Excellent; the correct label is almost always in the top 3 predictions.
---

In [45]:
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step


In [46]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[317  65  26 110   0]
 [ 67  47   4  57   0]
 [ 29  13  45  57   1]
 [ 90  29  17 874   0]
 [  7   1   2   9   0]]


### 🧾 Confusion Matrix Insights
- Noticeable confusion between **'inside'**, **'outside'**, and **'drink'** classes.
- **'Food'** has many correct predictions but also attracts false positives from other classes.
---

In [47]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=list(label_dict.keys())))

Classification Report:
              precision    recall  f1-score   support

      inside       0.62      0.61      0.62       518
     outside       0.30      0.27      0.28       175
       drink       0.48      0.31      0.38       145
        food       0.79      0.87      0.83      1010
        menu       0.00      0.00      0.00        19

    accuracy                           0.69      1867
   macro avg       0.44      0.41      0.42      1867
weighted avg       0.67      0.69      0.67      1867



### 📉 Class-wise Performance
- **'Food' class** performs best with an F1-score of `0.83`, thanks to high support (sample count).
- **'Menu' class** has extremely poor performance (`F1-score: 0.00`) – likely due to very low support (only 19 samples).
- **'Outside'** and **'drink'** classes have low precision, recall, and F1-scores – indicating frequent misclassifications.
---

In [53]:
y_test_cat = to_categorical(y_test, num_classes=num_classes)

In [55]:
print("\nAUC Scores:")
auc_scores = []
for i in range(num_classes):
    try:
        auc = roc_auc_score(y_test_cat[:, i], y_pred_probs[:, i])
        auc_scores.append(auc)
        print(f"Class {list(label_dict.keys())[i]} AUC: {auc:.3f}")
    except:
        auc_scores.append(None)
        print(f"Class {list(label_dict.keys())[i]} AUC: N/A")


AUC Scores:
Class inside AUC: 0.851
Class outside AUC: 0.785
Class drink AUC: 0.768
Class food AUC: 0.893
Class menu AUC: 0.717


In [66]:
logloss = log_loss(y_test, y_pred_probs)
balanced_acc = balanced_accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
mcc = matthews_corrcoef(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
weighted_f1 = f1_score(y_test, y_pred, average='weighted')
macro_recall = recall_score(y_test, y_pred, average='macro')  # Sensitivity
macro_precision = precision_score(y_test, y_pred, average='macro')

In [68]:
print("\nSpecificity per Class:")
specificity_per_class = []
for i in range(num_classes):
    tn = np.sum((y_test != i) & (y_pred != i))
    fp = np.sum((y_test != i) & (y_pred == i))
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    specificity_per_class.append(specificity)
    print(f"Class {list(label_dict.keys())[i]} Specificity: {specificity:.3f}")


Specificity per Class:
Class inside Specificity: 0.857
Class outside Specificity: 0.936
Class drink Specificity: 0.972
Class food Specificity: 0.728
Class menu Specificity: 0.999


### ✔️ Specificity per Class
- High specificity for all classes:
  - `inside`: 0.857
  - `outside`: 0.936
  - `drink`: 0.972
  - `food`: 0.728
  - `menu`: 0.999
- Suggests the model is good at **not misclassifying other classes as these**, but may be too conservative in predicting them (especially for 'menu').
---

In [70]:
print("\n=== Extended Metrics Summary ===")
print(f"Log Loss: {logloss:.4f}")
print(f"Balanced Accuracy: {balanced_acc:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")
print(f"Matthews Correlation Coefficient: {mcc:.4f}")
print(f"Macro F1-Score: {macro_f1:.4f}")
print(f"Weighted F1-Score: {weighted_f1:.4f}")
print(f"Macro Recall (Sensitivity): {macro_recall:.4f}")
print(f"Macro Precision: {macro_precision:.4f}")


=== Extended Metrics Summary ===
Log Loss: 1.9876
Balanced Accuracy: 0.4112
Cohen's Kappa: 0.4714
Matthews Correlation Coefficient: 0.4733
Macro F1-Score: 0.4208
Weighted F1-Score: 0.6737
Macro Recall (Sensitivity): 0.4112
Macro Precision: 0.4386


### 📊 Advanced Metrics
- **Log Loss**: `1.9876` – High, indicating overconfident yet incorrect predictions.
- **Balanced Accuracy**: `0.4112` – Low, showing the model doesn't generalize well across all classes.
- **Cohen’s Kappa**: `0.4714` – Moderate agreement with true labels, better than chance.
- **Matthews Correlation Coefficient**: `0.4733` – Indicates moderate classification quality.
- **Macro F1-Score**: `0.4208` vs. **Weighted F1-Score**: `0.6737` – Shows performance skewed by class imbalance.
- **Macro Recall (Sensitivity)**: `0.4112`, **Macro Precision**: `0.4386` – Both indicate overall weak ability to identify minority classes.
---

## 📌 Summary

- The model shows **strong bias toward the dominant class ('food')**, leading to misleadingly high accuracy and weighted F1-score.
- **Excellent top-3 accuracy** reveals strong ranking ability – useful in multi-suggestion systems.
- **Severe class imbalance** causes underperformance on minority classes such as 'menu', 'outside', and 'drink'.
- **Overfitting signs** visible: training accuracy increases steadily, but validation accuracy plateaus early.
- **Predictions are overconfident**, as shown by high log loss; calibration may be required.
- **Key improvements needed**:
  - Address class imbalance (resampling, class weighting)
  - Use regularization and/or early stopping
  - Try alternative loss functions (e.g., focal loss)
  - Improve confidence calibration (e.g., temperature scaling)

### Round 2 

In [73]:
print("\n=== Round 2: Add Dropout and tune optimizer ===")

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=input_shape),
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')])


=== Round 2: Add Dropout and tune optimizer ===


In [75]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(
    optimizer=optimizer,
    loss='sparse_categorical_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.SparseTopKCategoricalAccuracy(k=3)])

In [77]:
history = model.fit(X_train, y_train, epochs=15, validation_split=0.2, batch_size=32, verbose=2)

Epoch 1/15
187/187 - 8s - 43ms/step - accuracy: 0.5381 - loss: 1.1687 - sparse_top_k_categorical_accuracy: 0.9089 - val_accuracy: 0.5689 - val_loss: 1.1722 - val_sparse_top_k_categorical_accuracy: 0.9123
Epoch 2/15
187/187 - 6s - 31ms/step - accuracy: 0.5915 - loss: 1.0591 - sparse_top_k_categorical_accuracy: 0.9223 - val_accuracy: 0.6325 - val_loss: 0.9546 - val_sparse_top_k_categorical_accuracy: 0.9190
Epoch 3/15
187/187 - 6s - 32ms/step - accuracy: 0.6325 - loss: 0.9414 - sparse_top_k_categorical_accuracy: 0.9344 - val_accuracy: 0.6539 - val_loss: 0.9050 - val_sparse_top_k_categorical_accuracy: 0.9304
Epoch 4/15
187/187 - 6s - 33ms/step - accuracy: 0.6645 - loss: 0.8738 - sparse_top_k_categorical_accuracy: 0.9396 - val_accuracy: 0.6539 - val_loss: 0.8988 - val_sparse_top_k_categorical_accuracy: 0.9331
Epoch 5/15
187/187 - 6s - 33ms/step - accuracy: 0.6847 - loss: 0.8235 - sparse_top_k_categorical_accuracy: 0.9478 - val_accuracy: 0.6841 - val_loss: 0.8565 - val_sparse_top_k_categoric

In [78]:
test_loss, test_acc, test_top3_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Round 2 Test Accuracy: {test_acc:.4f}")
print(f"Round 2 Test Top-3 Accuracy: {test_top3_acc:.4f}")

Round 2 Test Accuracy: 0.7076
Round 2 Test Top-3 Accuracy: 0.9400


## 🔍 Observations

### 🧪 Model Performance
- **Test Accuracy**: `70.76%` – An improvement over Round 1 (`68.7%`).
- **Top-3 Accuracy**: `94.00%` – Slight drop from 100% in Round 1 but still very high.

In [79]:
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step


In [80]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[354  38  30  95   1]
 [ 67  51   7  50   0]
 [ 31  13  50  51   0]
 [ 90  19  35 866   0]
 [ 11   1   4   3   0]]


### 🧾 Confusion Matrix Insights
- Considerable misclassifications between **'inside'**, **'food'**, and **'drink'**.
- **'Menu'** is nearly ignored, despite high specificity, showing poor sensitivity.

In [81]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=list(label_dict.keys())))

Classification Report:
              precision    recall  f1-score   support

      inside       0.64      0.68      0.66       518
     outside       0.42      0.29      0.34       175
       drink       0.40      0.34      0.37       145
        food       0.81      0.86      0.83      1010
        menu       0.00      0.00      0.00        19

    accuracy                           0.71      1867
   macro avg       0.45      0.44      0.44      1867
weighted avg       0.69      0.71      0.70      1867



### 📉 Class-wise Performance
- **'Food'** remains the strongest class:
  - `Precision: 0.81`, `Recall: 0.86`, `F1-score: 0.83`
- **'Inside'** performs moderately well:
  - `F1-score: 0.66`, showing balanced prediction capability.
- **'Outside'** and **'Drink'** classes still suffer from low F1-scores (`0.34` and `0.37` respectively).
- **'Menu'** continues to have `0.00` for all classification metrics – completely missed by the model.

In [82]:
y_test_cat = to_categorical(y_test, num_classes=num_classes)

In [83]:
auc_scores = []
for i in range(num_classes):
    try:
        auc = roc_auc_score(y_test_cat[:, i], y_pred_probs[:, i])
    except:
        auc = None
    auc_scores.append(auc)
    print(f"Class {list(label_dict.keys())[i]} AUC: {auc:.3f}" if auc is not None else f"Class {list(label_dict.keys())[i]} AUC: N/A")

Class inside AUC: 0.868
Class outside AUC: 0.801
Class drink AUC: 0.788
Class food AUC: 0.897
Class menu AUC: 0.726


### 📈 AUC per Class
- Strong AUC values for all classes, especially:
  - `Food: 0.897`, `Inside: 0.868`, `Outside: 0.801`, `Drink: 0.788`
  - **Menu still low** at `0.726`, reinforcing detection difficulty.
---

In [84]:
logloss = log_loss(y_test, y_pred_probs)
balanced_acc = balanced_accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
mcc = matthews_corrcoef(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
weighted_f1 = f1_score(y_test, y_pred, average='weighted')
macro_recall = recall_score(y_test, y_pred, average='macro')
macro_precision = precision_score(y_test, y_pred, average='macro')

In [85]:
specificity_per_class = []
for i in range(num_classes):
    tn = np.sum((y_test != i) & (y_pred != i))
    fp = np.sum((y_test != i) & (y_pred == i))
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    specificity_per_class.append(specificity)
    print(f"Class {list(label_dict.keys())[i]} Specificity: {specificity:.3f}")

Class inside Specificity: 0.852
Class outside Specificity: 0.958
Class drink Specificity: 0.956
Class food Specificity: 0.768
Class menu Specificity: 0.999


### ✔️ Specificity per Class
- Specificity remains high for all classes:
  - `Inside: 0.852`, `Outside: 0.958`, `Drink: 0.956`, `Food: 0.768`, `Menu: 0.999`
- High specificity but low recall for 'menu' means the model **rarely predicts it**, even when it's correct.

---

In [86]:
print(f"Log Loss: {logloss:.4f}")
print(f"Balanced Accuracy: {balanced_acc:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")
print(f"Matthews Correlation Coefficient: {mcc:.4f}")
print(f"Macro F1-Score: {macro_f1:.4f}")
print(f"Weighted F1-Score: {weighted_f1:.4f}")
print(f"Macro Recall (Sensitivity): {macro_recall:.4f}")
print(f"Macro Precision: {macro_precision:.4f}")

Log Loss: 1.1320
Balanced Accuracy: 0.4354
Cohen's Kappa: 0.5108
Matthews Correlation Coefficient: 0.5120
Macro F1-Score: 0.4416
Weighted F1-Score: 0.6958
Macro Recall (Sensitivity): 0.4354
Macro Precision: 0.4536


### 📊 Advanced Metrics
- **Log Loss**: `1.1320` – Improved compared to Round 1 (`1.9876`), indicating better probability calibration.
- **Balanced Accuracy**: `0.4354` – Slight improvement over Round 1 (`0.4112`).
- **Cohen’s Kappa**: `0.5108`, **MCC**: `0.5120` – Moderate agreement, better than previous round.
- **Macro F1-score**: `0.4416`, **Weighted F1-score**: `0.6958` – Improvement in balanced performance.
- **Macro Recall**: `0.4354`, **Macro Precision**: `0.4536` – Better ability to recognize minority classes, though still limited.
---

## 📌 Summary – Round 2

- **Test accuracy increased** slightly, and log loss decreased significantly – indicating better calibration and improved overall performance.
- **Model still favors majority class ('food')**, but shows **incremental improvements** in minority class handling.
- **Menu class is still completely missed**, suggesting a serious issue with either label representation or model attention.
- **Confusion remains** among the intermediate classes like 'inside', 'outside', and 'drink'.
- **Advanced metrics (Kappa, MCC)** show consistent moderate agreement, better than Round 1.
- Improvements likely stem from changes in training, data augmentation, or optimization.

### ✅ Recommendation
- Further improvement may require:
  - **Class balancing techniques** (oversampling/undersampling, SMOTE)
  - **Focal Loss** or **class-weighted loss functions**
  - **Data augmentation** for underrepresented classes
  - **Model ensembling** or architecture tuning
  - **Error analysis** on 'menu' class to ensure proper data quality and representation

### Round 3

In [98]:
print("\n=== Round 3: Add BatchNorm and Data Augmentation ===")
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    zoom_range=0.1
)


=== Round 3: Add BatchNorm and Data Augmentation ===


In [104]:
# Build model with BatchNormalization
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=input_shape),
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
    BatchNormalization(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    BatchNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

In [106]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0003)
model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy',metrics=['accuracy',tf.keras.metrics.SparseTopKCategoricalAccuracy(k=3)])

In [108]:
history = model.fit(
    datagen.flow(X_train, y_train, batch_size=32),
    epochs=20,
    validation_data=(X_test, y_test),
    verbose=2)

Epoch 1/20


  self._warn_if_super_not_called()


234/234 - 18s - 78ms/step - accuracy: 0.4111 - loss: 1.8600 - sparse_top_k_categorical_accuracy: 0.7500 - val_accuracy: 0.2512 - val_loss: 1.6676 - val_sparse_top_k_categorical_accuracy: 0.7167
Epoch 2/20
234/234 - 14s - 60ms/step - accuracy: 0.5669 - loss: 1.2985 - sparse_top_k_categorical_accuracy: 0.8493 - val_accuracy: 0.5142 - val_loss: 1.4770 - val_sparse_top_k_categorical_accuracy: 0.8056
Epoch 3/20
234/234 - 15s - 63ms/step - accuracy: 0.6430 - loss: 1.0892 - sparse_top_k_categorical_accuracy: 0.8910 - val_accuracy: 0.6540 - val_loss: 1.0899 - val_sparse_top_k_categorical_accuracy: 0.8848
Epoch 4/20
234/234 - 14s - 62ms/step - accuracy: 0.6697 - loss: 0.9860 - sparse_top_k_categorical_accuracy: 0.9112 - val_accuracy: 0.6079 - val_loss: 1.2801 - val_sparse_top_k_categorical_accuracy: 0.8243
Epoch 5/20
234/234 - 14s - 60ms/step - accuracy: 0.6952 - loss: 0.9012 - sparse_top_k_categorical_accuracy: 0.9274 - val_accuracy: 0.7450 - val_loss: 0.7949 - val_sparse_top_k_categorical_acc

In [109]:
test_loss, test_acc, test_top3_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Round 3 Test Accuracy: {test_acc:.4f}")
print(f"Round 3 Test Top-3 Accuracy: {test_top3_acc:.4f}")

Round 3 Test Accuracy: 0.7574
Round 3 Test Top-3 Accuracy: 0.9529


---

## 🔍 Observations

### 🧪 Model Performance
- **Test Accuracy**: `75.74%` – Strong improvement over Round 2 (`70.76%`) and Round 1 (`68.7%`).
- **Top-3 Accuracy**: `95.29%` – Remains high and reliable.

In [110]:
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step


In [111]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[478  18   4  17   1]
 [105  54   0  16   0]
 [ 66  11  36  30   2]
 [145  15   5 845   0]
 [ 16   0   0   2   1]]


### 🧾 Confusion Matrix Insights
- **High recall for 'inside'** – only 40 misclassified out of 518.
- **'Outside' and 'Drink'** classes often confused with 'inside' and 'food'.
- **'Menu'** still largely misclassified but slightly better than before – now has a few correct predictions.

---

In [112]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=list(label_dict.keys())))

Classification Report:
              precision    recall  f1-score   support

      inside       0.59      0.92      0.72       518
     outside       0.55      0.31      0.40       175
       drink       0.80      0.25      0.38       145
        food       0.93      0.84      0.88      1010
        menu       0.25      0.05      0.09        19

    accuracy                           0.76      1867
   macro avg       0.62      0.47      0.49      1867
weighted avg       0.78      0.76      0.74      1867



### 📉 Class-wise Performance
- **'Food'** class continues to dominate:
  - `Precision: 0.93`, `Recall: 0.84`, `F1-score: 0.88`
- **'Inside'** shows major improvement:
  - `Recall: 0.92`, `F1-score: 0.72` – much better than in previous rounds.
- **'Outside'** and **'Drink'** still underperform on recall (`0.31` and `0.25`), with low F1-scores (`0.40` and `0.38`).
- **'Menu'** now has non-zero predictions:
  - `Precision: 0.25`, `Recall: 0.05`, `F1-score: 0.09` – slight progress from 0.00 in previous rounds.

---

In [113]:
y_test_cat = to_categorical(y_test, num_classes=num_classes)

In [114]:
auc_scores = []
for i in range(num_classes):
    try:
        auc = roc_auc_score(y_test_cat[:, i], y_pred_probs[:, i])
    except:
        auc = None
    auc_scores.append(auc)
    print(f"Class {list(label_dict.keys())[i]} AUC: {auc:.3f}" if auc is not None else f"Class {list(label_dict.keys())[i]} AUC: N/A")

Class inside AUC: 0.925
Class outside AUC: 0.869
Class drink AUC: 0.870
Class food AUC: 0.956
Class menu AUC: 0.879


### 📈 AUC per Class
- All classes show strong AUC:
  - `Food: 0.956`, `Inside: 0.925`, `Drink: 0.870`, `Outside: 0.869`, `Menu: 0.879`
- Notable jump in **'menu'** AUC from Round 2 (`0.726` to `0.879`) – indicates improved ranking even if raw classification remains weak.

---

In [115]:
logloss = log_loss(y_test, y_pred_probs)
balanced_acc = balanced_accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
mcc = matthews_corrcoef(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
weighted_f1 = f1_score(y_test, y_pred, average='weighted')
macro_recall = recall_score(y_test, y_pred, average='macro')
macro_precision = precision_score(y_test, y_pred, average='macro')

In [116]:
specificity_per_class = []

for i in range(num_classes):
    tn = np.sum((y_test != i) & (y_pred != i))
    fp = np.sum((y_test != i) & (y_pred == i))
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    specificity_per_class.append(specificity)
    print(f"Class {list(label_dict.keys())[i]} Specificity: {specificity:.3f}")

Class inside Specificity: 0.754
Class outside Specificity: 0.974
Class drink Specificity: 0.995
Class food Specificity: 0.924
Class menu Specificity: 0.998


### ✔️ Specificity per Class
- Specificity remains high across the board:
  - `Inside: 0.754`, `Outside: 0.974`, `Drink: 0.995`, `Food: 0.924`, `Menu: 0.998`
- **Very high specificity for 'drink' and 'menu'**, suggesting model is still cautious about predicting them.

---

In [117]:
print(f"Log Loss: {logloss:.4f}")
print(f"Balanced Accuracy: {balanced_acc:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")
print(f"Matthews Correlation Coefficient: {mcc:.4f}")
print(f"Macro F1-Score: {macro_f1:.4f}")
print(f"Weighted F1-Score: {weighted_f1:.4f}")
print(f"Macro Recall (Sensitivity): {macro_recall:.4f}")
print(f"Macro Precision: {macro_precision:.4f}")

Log Loss: 0.6911
Balanced Accuracy: 0.4738
Cohen's Kappa: 0.6017
Matthews Correlation Coefficient: 0.6183
Macro F1-Score: 0.4923
Weighted F1-Score: 0.7433
Macro Recall (Sensitivity): 0.4738
Macro Precision: 0.6239


### 📊 Advanced Metrics
- **Log Loss**: `0.6911` – Significant improvement from Round 2 (`1.1320`) and Round 1 (`1.9876`), indicating better confidence calibration.
- **Balanced Accuracy**: `0.4738` – Best so far across all rounds.
- **Cohen’s Kappa**: `0.6017`, **MCC**: `0.6183` – Indicates strong agreement and correlation.
- **Macro F1-score**: `0.4923`, **Weighted F1-score**: `0.7433` – Clear improvement, especially in class balance.
- **Macro Recall**: `0.4738`, **Macro Precision**: `0.6239` – Highest recall and precision among all rounds.
---

## 📌 Summary – Round 3

- **Best round so far** in nearly all metrics: accuracy, F1-score, log loss, and calibration.
- **'Inside' class performance significantly improved**, likely due to better feature learning or data balance.
- **'Food' class remains the most confidently and accurately predicted.**
- **'Menu' class still underperforms**, but is finally detected with non-zero recall and a much-improved AUC.
- **Class imbalance persists**, though the model is becoming more robust and confident across classes.

### ✅ Recommendation
- Maintain this current training setup and fine-tune further for minority classes:
  - Use targeted **data augmentation** for 'menu', 'outside', 'drink'
  - Experiment with **cost-sensitive loss** or **oversampling** the minority classes
  - Consider ensembling or **multi-head architectures** to balance precision/recall trade-offs
  - Use **calibration tools** like Platt scaling or isotonic regression for further log loss improvements