# Contrastive Language Image Pre-Training (CLIP) on Radiology Objects in COntext (ROCO)

## Import libraries and data
Before starting executing the notebook, do the following steps:
- Go to "Runtime" > "Change type of runtime" and select a GPU-based runtime;
- Load the `resized_train.zip`, `caption_prediction_train.csv`, `concept_detection_train.csv`  files in the File folder.

In [25]:
import numpy as np
import pandas as pd
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt

tfk = tf.keras
tfkl = tfk.layers

Extract the images from `resized_train.zip`:

In [None]:
!unzip resized_train.zip

Load textual data:

In [26]:
# Load the captions
captions = pd.read_csv("caption_prediction_train.csv",sep="\t", index_col="ID")

# Load the labels
labels = pd.read_csv("concept_detection_train.csv",sep="\t",index_col="ID")
# Each label string contains multiple labels separated by a semicolumn.
# Transform the strings in lists of labels.
labels["cuis"] = labels["cuis"].str.split(pat=";")

We explore the images to check if they are all 128x128 pixel and how many channels they have:

In [None]:
max_channels = 1
for i in labels.index:
  image_name = "resized_train/"+i+".jpg"
  img = Image.open(image_name)
  img = np.array(img)
  if img.ndim==3 and img.shape[2] > max_channels:
    max_channels = img.shape[2]
  # Check if the image has the expected resolution of 128x128
  if img.shape[0]!=128 or img.shape[1]!=128:
    print(f"Error in {image_name}: its resolution is " +
      f"{img.shape[0]}x{img.shape[1]}, while it should be 128x128")

print(f"The maximum number of channels is {max_channels}.")

The maximum number of channels is 3.


We import all the images, we shuffle them and we split them in training and test sets:

In [27]:
images_train, images_test = tfk.utils.image_dataset_from_directory(
    "./resized_train/", batch_size=32, labels=None, label_mode=None,
    image_size=(128,128), shuffle=True, color_mode = "rgb",
    validation_split = 0.2, subset="both", seed=32414141)

Found 83275 files belonging to 1 classes.
Using 66620 files for training.
Using 16655 files for validation.


Following the division of the images, we split also the labels and the captions in training and test sets.

In [28]:
# Function to get the ordered IDs of the images in the datasets.
train_idx = np.char.strip(
    np.array(images_train.file_paths),"./resized_train/.jpg")
test_idx = np.char.strip(
    np.array(images_train.file_paths),"./resized_train/.jpg")

captions_train = captions.loc[train_idx]
captions_test = captions.loc[test_idx]
labels_train = labels.loc[train_idx]
labels_test = labels.loc[test_idx]

Set the random seed for reproducibility:

In [29]:
seed = 24948989491

rng = np.random.default_rng(seed)
tf.random.set_seed(seed)

## Build the CLIP model

### Define the CNN

We define the convolutional block, which will be the base block for the CNN.

In [30]:
def convolutional_block(x, filters, kernel):
  '''
  This function represents a convolutional block in a neural network.

  Parameters:
  - x: Input tensor, size (batch,h,w,channels)
  filters: Number of filters for the convolution operation
  kernel: Size of the convolution kernel

  Returns:
  - Output tensor after applying convolution and max pooling operations, size
    (batch,h/2,w/2,filters)

  Description:
  The convolutional_block function takes an input tensor x and applies a
  convolution operation with the specified number of filters and kernel size.
  The convolution is performed with a stride of 1, using 'same' padding (i.e.
  padding with zero on all the borders of the image), and the ReLU activation
  function. After the convolution, a max pooling operation is applied to the
  output tensor. The resulting tensor is returned as the output of the
  convolutional block.
  '''
  # Apply convolution operation to the input x
  x = tfkl.Conv2D(
      filters,
      kernel,
      strides=1,
      padding='same',
      activation='relu'
  )(x)
  # Apply max pooling operation to the convolution output
  x = tfkl.MaxPooling2D()(x)

  return x

We define the convolutional neural network, which is composed of a series of convolutional blocks and a flatten block.

In [31]:
def get_convolutional_neural_network(blocks, filters, kernel):
  '''
  This function creates a convolutional neural network (CNN) model with the
  specified number of blocks, filters, and kernel size.

  Parameters:
  - blocks: Number of convolutional blocks in the network
  - filters: Number of filters for each convolutional block
  - kernel: Size of the convolution kernel

  Returns:
  - CNN model

  Description:
  The get_convolutional_neural_network function constructs a CNN model using the
  Keras API. It takes the number of blocks, the number of filters for each
  convolutional block, and the kernel size as input parameters.

  The function begins by defining an input layer with the specified input shape.
  The inputs are then normalized using the Normalization layer provided by
  TensorFlow Keras (tfkl).

  Next, the function iterates through the specified number of blocks and applies
  the convolutional_block function to extract features via convolution and
  pooling operations.

  After the convolutional blocks, the output tensor is flattened using the
  Flatten layer from tfkl.

  Finally, the Keras model is created using the tfk.Model constructor,
  specifying the inputs and outputs.

  If the model receives in input a (batch,h,w,channels) tensor, it returns a
  (batch,h/(2^blocks)*w/(2^blocks)*filters*blocks) tensor.
  '''
  inputs = tfkl.Input(shape=input_shape, name='inputs')
  x = tfkl.Normalization()(inputs)
  # Extract features via convolution and pooling operations
  for b in range(blocks):
      x = convolutional_block(x, filters*(b+1), kernel)
  x = tfkl.Flatten()(x)
  # Create the Keras model
  model = tfk.Model(inputs=inputs, outputs=x)
  return model