# **Background and objective**
https://www.kaggle.com/datasets/denizkavi1/brain-tumor

Summary by the author of a paper on this dataset

*This brain tumor dataset containing 3064 T1-weighted contrast-inhanced images from 233 patients with three kinds of brain tumor: meningioma (708 slices), glioma (1426 slices), and pituitary tumor (930 slices).
This data was used in the following paper:*

We create a DNN that will distinguish and predict these 3 classes (kinds of tumors). A principled loss function based on the medical costs of each error will be created and used. A principled performance metric on the same basis will be created and maximized via a hyperparameter grid search.

# **Problem in general form**

A DNN with $L$ layers between $y$ and the image data matrix $X$ can be represented in the generic form

$$y=(f_{L} \circ f_{L-1} \circ....\circ f_{1})(X) \in \{Meningioma,Glioma, Pituitary\}$$

$$f_{k}: \mathbb{R}^{d_{k}} \to \mathbb{R}^{d_{k+1}}$$

Where $d_{k}$ is the dimension (width) of the $k^{th}$ layer. Let $\textbf{d} \in \mathbb{N}^{L}$ be the vector of layer widths (dimensions). $f_{k}$ is the activation function between layer $k$ and layer $k+1$.

We also have a hyperparameter for training the DNN $H$ defined as

$$H=(r,B,L) \in \mathbb{N}^{3}$$

Where

*   $r$ is the dimension of the square resolution (resolution = $r \times r$)
*   $B$ is the batch size of training
*   $L$ is the number of layers between $X$ and $y$. Our $f_{L}$ would be the activation



# **Specifying the functions and hyperparameters**



1.   **Choosing what $f_{k}$ and $\textbf{d}$ to use**
      
      For $k \neq L$, $f_{k}$ are functions that don't activate the final output. These are activations between hidden layers. For all $k \neq L$, we choose $f_{k}=ReLu$.

      To design our neural achitecture $\textbf{d}$ we must consider that the visual complexity of the tumor itself isn't high, but at the same time there is considerable positional variation of the tumor itself. We should make $\textbf{d}$ pretty wide for the first few layers and then abruptly narrow it afterward. The idea is to scan widely for a simple object, but then aggressively zero in afterward once we get a signal.
      
      Let's try
      *   $\textbf{d}=(256,64)$ if $L=2$
      *   $\textbf{d}=(256,256,16)$ if $L=3$
      *   $\textbf{d}=(256,256,32,16)$ if $L=4$

  The three different choices of $L$ here will be justified in the (2) point below.

 2.   **Choosing a search space $\mathcal{H}$ for $H$ and why**

      Let
$$r \in \{128,256\},$$
$$B \in \{32,64\},$$
$$L \in \{2,3,4\}$$

We do not need a very large resolution. The tumor is visually very distinct and it doesn't really blend in with the background.


We use modest batch sizes of 32 or 64 due to the low resolution of the images. That is, we can afford higher resolution learning with lower resolution data.

Hidden layers are between 2 and 4 as the images are low resolution and we are detecting tumors. Both complex learning and computational cost are reasonably respected.







    






# **Choosing an informed loss function $\mathcal{L}(\hat{Y}, Y)$**

Let $Y \in \mathbb{R}^{3 \times n}$ be a $3 \times n$ matrix and each row $Y_{i}$ be a vector of entries 0 or 1. If the $k^{th}$ entry is 1, this means it is class $k$. Let $\hat{Y_{i}}$ be a vector of probabilities for each class.

Let Classes 0,1,2 be Meningioma, Glioma, and Pituitary respectively. We will create a weighted cross-entropy loss where each weight is the mortality risk divided by 10. For example, the death rate of meningioma, glioma and pituitary are 5%, 80%, and 1% respectively. So our weight vector becomes $\textbf{w}=[0.5,8,0.1]$. Our cost function is then

$$\mathcal{L}(\hat{Y},Y)=\frac{-1}{n} \sum_{i=1}^{n} \sum_{j=0}^{2}w_{j}Y_{ij}\log(\hat{Y}_{ij})$$



# **Choosing an informed performance metric with respect to $H$, $\mathcal{M}(H)$**

This will be designed using the same principles used to design our loss function $\mathcal{L}(\hat{Y},Y)$. It will be a normalized weighted combination of the F1 scores of each class, where the weights are the same as the loss function.

$$\mathcal{M}(H)=\frac{8 \cdot F1_{glioma}+0.1 \cdot F1_{pituitary} + 0.5 \cdot F1_{meningioma}}{8.6}$$

We divide it by 8.6 because the maximum value of the numerator is 8.6. This way, $\mathcal{M}(H) \in [0,1]$. For each F1 score we set a $\beta$. Considering the aggression and lethality of glioma, we will aggressively emphasize recall and make $\beta_{glioma}=3$, making it $9 \times$ more important. We will have $\beta_{meningioma}=\beta_{pituitary}=1$.




# **Technical implementations**

Load the data

In [None]:
!unzip /content/TumorImages.zip -d /content/dataset


In [62]:

dataset_path = "/content/dataset/TumorImages"


In [None]:
import tensorflow as tf

def prepare_data(dataset_path, B, r):

  img_size = (r, r)
  from tensorflow.keras.utils import image_dataset_from_directory
  import io
  import contextlib

  with contextlib.redirect_stdout(io.StringIO()):
    train_ds = tf.keras.utils.image_dataset_from_directory(
        dataset_path,
        labels='inferred',
        label_mode='categorical',
        batch_size=B,
        image_size=img_size,
        shuffle=True,
        validation_split=0.2,
        subset='training',
        seed=123
      )

    val_ds = tf.keras.utils.image_dataset_from_directory(
        dataset_path,
        labels='inferred',
        label_mode='categorical',
        batch_size=B,
        image_size=img_size,
        shuffle=True,
        validation_split=0.2,
        subset='validation',
        seed=123
      )

# Optional: Prefetch for performance
  AUTOTUNE = tf.data.AUTOTUNE
  train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)
  val_ds = val_ds.prefetch(buffer_size=AUTOTUNE)

  return train_ds, val_ds


In [None]:
def get_model(L):
    if L == 2:
        hidden_layers = [256, 64]
    elif L == 3:
        hidden_layers = [256, 256, 16]
    elif L == 4:
        hidden_layers = [256, 256, 32, 16]
    else:
        raise ValueError("Unsupported L value. Choose 2, 3, or 4.")

    layers = [tf.keras.layers.Flatten()]
    for units in hidden_layers:
        layers.append(tf.keras.layers.Dense(units, activation='relu'))
    layers.append(tf.keras.layers.Dense(3, activation='softmax'))

    model = tf.keras.Sequential(layers)
    return model



In [None]:

def custom_weighted_loss(y_true, y_pred):
    class_weights = tf.constant([0.5, 8.0, 0.1])
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)

    ce = -tf.reduce_sum(y_true * tf.math.log(y_pred), axis=1)
    weights = tf.reduce_sum(y_true * class_weights, axis=1)
    weighted_ce = weights * ce

    return tf.reduce_mean(weighted_ce)


In [None]:
def custom_metric(beta_glioma):
    def metric_fn(y_true, y_pred):
        y_true = tf.cast(y_true, tf.float32)
        y_pred_labels = tf.argmax(y_pred, axis=1)
        y_true_labels = tf.argmax(y_true, axis=1)

        def f1_beta(class_id, beta):
            true_pos = tf.reduce_sum(tf.cast((y_pred_labels == class_id) & (y_true_labels == class_id), tf.float32))
            pred_pos = tf.reduce_sum(tf.cast((y_pred_labels == class_id), tf.float32))
            actual_pos = tf.reduce_sum(tf.cast((y_true_labels == class_id), tf.float32))

            precision = true_pos / (pred_pos + 1e-7)
            recall = true_pos / (actual_pos + 1e-7)

            beta_sq = beta ** 2
            return (1 + beta_sq) * precision * recall / (beta_sq * precision + recall + 1e-7)

        f1_meningioma = f1_beta(0, 1.0)
        f1_glioma     = f1_beta(1, beta_glioma)
        f1_pituitary  = f1_beta(2, 1.0)

        return (8.0 * f1_glioma + 0.1 * f1_pituitary + 0.5 * f1_meningioma)/8.6

    return metric_fn


In [None]:
from tensorflow.keras.callbacks import EarlyStopping

def train_model(model, train_ds, val_ds, E):

  model.compile(
      optimizer='adam',
      loss=custom_weighted_loss,
      metrics=[custom_metric(beta_glioma=3.0)]
  )

  early_stop = EarlyStopping(
      monitor='metric_fn',  # Replace with your actual metric name
      min_delta=0.005,              # Minimum improvement threshold
      patience=5,                   # Number of epochs to wait
      mode='max',                   # Because higher is better for metrics
      restore_best_weights=True
  )

# Then plug into model.fit
  history=model.fit(
      train_ds,
      validation_data=val_ds,
      epochs=E,
      callbacks=[early_stop],
      verbose=0
  )
  M = history.history['metric_fn'][-1]
  return M


In [None]:


import itertools


batch_sizes = [32,64]
image_sizes = [128,256]
layers = [2, 3, 4]


combinations = list(itertools.product(batch_sizes, image_sizes, layers))


for B, r, L, in combinations:
    print(f"Running config: r={r}, B={B}, L={L}, epochs={50}")
    train_ds, val_ds = prepare_data(dataset_path, B, r)
    model = get_model(L)
    metric=train_model(model, train_ds, val_ds, 50)
    print(f"Metric (M): {metric}")

Running config: r=128, B=32, L=2, epochs=50
Metric (M): 0.8292699456214905
Running config: r=128, B=32, L=3, epochs=50
Metric (M): 0.005325795151293278
Running config: r=128, B=32, L=4, epochs=50
Metric (M): 0.8292699456214905
Running config: r=256, B=32, L=2, epochs=50
Metric (M): 0.8292699456214905
Running config: r=256, B=32, L=3, epochs=50
Metric (M): 0.005325795151293278
Running config: r=256, B=32, L=4, epochs=50
Metric (M): 0.02126740850508213
Running config: r=128, B=64, L=2, epochs=50
Metric (M): 0.834864616394043
Running config: r=128, B=64, L=3, epochs=50
Metric (M): 0.8326484560966492
Running config: r=128, B=64, L=4, epochs=50
Metric (M): 0.021434083580970764
Running config: r=256, B=64, L=2, epochs=50
Metric (M): 0.005366576835513115
Running config: r=256, B=64, L=3, epochs=50
Metric (M): 0.021434083580970764
Running config: r=256, B=64, L=4, epochs=50
Metric (M): 0.005366576835513115


The configuration $H$ that maximizes 

$\mathcal{M}(H)$ is $H=(128,64,2)$ with $\mathcal{M}(H) \approx 0.835$