# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [1]:
from tensorflow.keras.applications import ResNet152V2
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
from tensorflow.keras import preprocessing
import numpy as np

BATCHSIZE = 32
EPOCHS = 1


2024-06-19 17:58:54.245319: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Model Choice

At the basis of my idea is a simple image detection problem, so I chose an already trained and established model for this task as baseline: ResNet152V2.

As baseline I just use the ResNet model without its top layer and just add an appropriate one for the task with 4 nodes. Just one epoch to train the weights for the second to the last layer.


## Feature Selection

The Dataset I am using offers only 4 categories of data:
- Eosinophil
- Lymphocyte
- Monocyte
- Neutrophil

I will use all 4 categories.


In [2]:
# use keras.preprocessing.image_dataset_from_directory to load images from ./TRAIN, split 80/20 for testing
train_data = preprocessing.image_dataset_from_directory(
    '../1_DatasetCharacteristics/train/',
    validation_split=0.2,
    subset='training',
    seed=123,
)

test_data = preprocessing.image_dataset_from_directory(
    '../1_DatasetCharacteristics/train/',
    validation_split=0.2,
    subset='validation',
    seed=123
)


Found 9957 files belonging to 4 classes.
Using 7966 files for training.


2024-06-19 17:58:57.729566: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 17:58:58.031865: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 17:58:58.033120: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

Found 9957 files belonging to 4 classes.
Using 1991 files for validation.


## Implementation

Implementation of the base ResNet152 Model, with minimal changes for the specified task. 

In [3]:
# load the ResNet152V2 model
base_model = ResNet152V2(weights='imagenet', include_top=False)

# add new top layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
predictions = Dense(4, activation='softmax')(x)

# create the new model
model = Model(inputs=base_model.input, outputs=predictions)

# freeze the base model layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics = ['accuracy'])

# print model summary
model.summary()

# train the model
model.fit(train_data, epochs=EPOCHS, batch_size = BATCHSIZE)

I0000 00:00:1718812753.570854    3795 service.cc:145] XLA service 0x7f35b00021a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1718812753.570894    3795 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 2060, Compute Capability 7.5
2024-06-19 17:59:14.040948: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-06-19 17:59:16.134148: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907


[1m  1/249[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:16:46[0m 19s/step - accuracy: 0.2500 - loss: 257.9072

I0000 00:00:1718812761.331647    3795 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 173ms/step - accuracy: 0.2623 - loss: 104.7551


<keras.src.callbacks.history.History at 0x7f36681c70d0>

## Evaluation

The model will be judged by the accuracy of the predictions. 

After around 10 different trainings, the accuracy was around 28%.

In [4]:
predictions = model.predict(test_data)
predictions = np.argmax(predictions, axis=1)
actual = np.concatenate([y for x, y in test_data], axis=0)
print(predictions)
print(actual)

# calculate accuracy
accuracy = np.mean(predictions == actual)
print(f'Accuracy: {accuracy}')

#save accuracy to file
with open('baseline-accuracy.txt', 'w') as f:
    f.write(f'Accuracy: {accuracy}')


[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 254ms/step
[3 3 3 ... 0 3 2]
[3 2 3 ... 3 2 1]
Accuracy: 0.2491210447011552


2024-06-19 18:00:25.783083: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
