# **Deep Learning Project**

Made by students:
  - **Emanuele Conforti (252122)**
  - **Jacopo Garofalo (252093)**
  - **Gianmarco La Marca (252256)**

## **Environment initialization**

The **aim** of the project is to **generate a report starting from chest x-rays images**.

## **ChestXRays Notebook Description**

This is the notebook with the best solution we found, using the following trained models (imported from the other notebooks):

- **encoderCNN**;
- **mapper with embedding approach (cosine similarity as loss function)**;
- **GPT2 transformer**.

Here we also compute the final **metrics** (**ROUGE** and **BLEU**)

### **Running the code on Colab**

- Run the following cells with the variable **onColab = True** if you are on Colab.
- We recommend to run all the codes on Kaggle!

In [None]:
onColab = False

if onColab:
    ! pip install kaggle
    ! mkdir ~/.kaggle
    ! cp kaggle.json ~/.kaggle/
    ! chmod 600 ~/.kaggle/kaggle.json

In [None]:
if onColab:
    ! kaggle datasets download raddar/chest-xrays-indiana-university

In [None]:
import zipfile
import os

if onColab:
    file_name = "chest-xrays-indiana-university.zip"
    
    # extract the file from the zip
    with zipfile.ZipFile(file_name, 'r') as zip_ref:
        zip_ref.extractall("chest_xrays_data")

In [None]:
if onColab:
    !ls chest_xrays_data

In [None]:
if onColab: 
    img_dir = 'chest_xrays_data/images/images_normalized/'
    reports_dir = 'chest_xrays_data/indiana_reports.csv'
    projections_dir = 'chest_xrays_data/indiana_projections.csv'
else:
    img_dir = '/kaggle/input/chest-xrays-indiana-university/images/images_normalized/'
    reports_dir = '/kaggle/input/chest-xrays-indiana-university/indiana_reports.csv'
    projections_dir = '/kaggle/input/chest-xrays-indiana-university/indiana_projections.csv'

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, Subset

from transformers import GPT2Tokenizer, GPT2LMHeadModel, BioGptTokenizer, BioGptForCausalLM
import torch.optim as optim
from torch.optim import AdamW

from tqdm import tqdm
from tqdm.auto import trange

import torchvision
from torchvision import transforms as T

In [None]:
torch.backends.cudnn.benchmark = True

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"Using device: {torch.cuda.get_device_name(0)}" if torch.cuda.is_available() else "Using CPU")

## **Pre-processing**

Here we print the datasets, analyze data inside them and prepare them for the training phase

#### **We first visualize the first rows of the datasets** 

In [None]:
reports_df = pd.read_csv(reports_dir)
reports_df.head()

In [None]:
projections_df = pd.read_csv(projections_dir)
projections_df.head()

In [None]:
reports_df.shape, projections_df.shape

In [None]:
def visualize_sample_data():
  for uid in range(1, 4):
    plt.figure(figsize=(10, 5))
    print("\nUID: ", uid)

    findings = list(reports_df[reports_df['uid'] == uid]['findings'])[0]
    images = projections_df[projections_df['uid'] == uid]['filename']

    for img in images:
      png_img = Image.open(os.path.join(img_dir, img))
      png_img = png_img.convert('RGB')
      plt.title(img)
      plt.imshow(png_img)
      plt.axis('off')
      plt.show()
    print("Findings:", findings)

visualize_sample_data()

#### **We check the number of null values for each column**

In [None]:
reports_df.info()

In [None]:
reports_df.isna().sum()

In [None]:
projections_df.info()

#### **We analyze the images distribution on the dataset rows (number of images per row)**

In [None]:
image_counts = projections_df.groupby("uid")["filename"].count()

num_uids = []
num_uids.append(reports_df["uid"].nunique() - image_counts.count())

for i in range(1, 7):
    num_uids.append((image_counts == i).sum())

print(f"Sum of all counted entries: {sum(num_uids)}")
print(f"Total entries: {image_counts.count()}")

labels = [f"{i} images" for i in range(0, 7)]

plt.figure(figsize=(8, 5))
plt.bar(labels, num_uids, color='skyblue')

plt.xlabel("Number of associated images")
plt.ylabel("Number of UIDs")
plt.title("Distribution of images number per UID")

for i, v in enumerate(num_uids):
    plt.text(i, v + 2, str(v), ha='center', fontsize=12)

plt.show()

In [None]:
# visualize the 5 images related to the same entry
def visualize_data(uid):
    print(f"UID with 5 images associated: {uid}")
    plt.figure(figsize=(10, 5))
    
    images = projections_df[projections_df['uid'] == uid]['filename']
    
    for img in images:
      png_img = Image.open(os.path.join(img_dir, img))
      png_img = png_img.convert('RGB')
      plt.title(img)
      plt.imshow(png_img)
      plt.axis('off')
      plt.show()
        
uid = list(image_counts[image_counts == 5].index)[0]
visualize_data(uid)

#### **Since we want to use the findings columns as labels (they represent the report we want to generate), we delete all the rows with null findings**

In [None]:
# filter the rows with null findings
reports_filtered = reports_df.dropna(subset=["findings"])

# keep only entries in projections that have a filtered report associated (association through uid)
projections_filtered = projections_df[projections_df["uid"].isin(reports_filtered["uid"])]
reports_filtered.shape, projections_filtered.shape

In [None]:
reports_filtered.isna().sum()

#### **Split the filtered dataset (containing only the UID column) in train and validation set**

In [None]:
VAL_SIZE = 0.1

uids = reports_filtered.uid.unique()

train_ds, val_ds = train_test_split(
    uids,
    test_size=VAL_SIZE,
    random_state=42
)

len(train_ds), len(val_ds)

#### **We create a custom dataset containing only the data we need:**
- **images**
- **tokenized findings**
- **attention mask**

In [None]:
# adjusted dataset
class ChestXRayDataset(Dataset):
    def __init__(self, reports_df, projections_df, image_folder, tokenizr, uids, transforms):
        self.reports_df = reports_df[reports_df["uid"].isin(uids)].reset_index(drop=True)
        self.projections_df = projections_df
        self.image_folder = image_folder
        self.tokenizer = tokenizr
        # a series of transformations to be applied to images before feeding them into a model
        self.transform = transforms

    def __len__(self):
        return len(self.reports_df)

    def __getitem__(self, idx):
        row = self.reports_df.iloc[idx]
        uid = row["uid"]
        text = row["findings"]

        # tokenize findings column
        encoded_text = self.tokenizer(
            text, padding="max_length", truncation=True, max_length=144, return_tensors="pt"
        )

        # find the path and filename of the associated image
        image_filename = self.projections_df[self.projections_df["uid"] == uid]["filename"].values[0]
        image_path = f"{self.image_folder}/{image_filename}"

        # load and trasform the image
        image = Image.open(image_path).convert("L")  # conversion to grayscale
        image = self.transform(image)

        # return the image, label (finding)
        return image, encoded_text["input_ids"].squeeze(0), encoded_text["attention_mask"].squeeze(0)

# initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

tf = T.Compose([
    T.Resize((224, 224)),  # resizing for pre-trained models
    T.ToTensor(),
])

train_dataset = ChestXRayDataset(reports_filtered, projections_filtered, img_dir, tokenizer, train_ds, tf)
val_dataset = ChestXRayDataset(reports_filtered, projections_filtered, img_dir, tokenizer, val_ds, tf)

#### **Visualize the data of the new dataset**

In [None]:
# the image should be a pytorch tensor 
image, label, att_mask = train_dataset[100]
image

In [None]:
image.shape

In [None]:
label

In [None]:
att_mask

#### **Create the dataloader, that is we split the data of the dataset previously created in batches. We do this operation for both train set and validation set**

In [None]:
BATCH_SIZE = 32

# create the DataLoader to generate batches of the dataset and iterate over them
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

#### **OutOfMemoryError: the following code is used for freeing the GPU cache**

In [None]:
import gc

gc.collect()
torch.cuda.empty_cache()

#### **Now we can start building our model. It will mainly be a CustomAutoencoder composed by the following elements:**
- **encoder**: a **convolutional encoder**, that will take the images and encode them in a **latent space**;
- **decoder**: a **transformer**, that will take the latent space generated by the encoder and the findings columns and generate the text (report);

In [None]:
def conv_layer(n_input, n_output, kernel_size, stride=1):
    return nn.Sequential(
        nn.Conv2d(n_input, n_output, kernel_size, stride),
        nn.ReLU(),
        nn.BatchNorm2d(n_output),
        nn.MaxPool2d(2)
    )

In [None]:
encoder = nn.Sequential(
            conv_layer(1, 64, 3),
            conv_layer(64, 128, 3),
            conv_layer(128, 256, 3),
            conv_layer(256, 512, 3)
        )

# In this case, we use encoderCNN as encoder. However, we also implemented a VAE and tried to use it 
# as encoder but it's not effective as the encoderCNN. You can find more on these two models on the
# EncoderChestX Notebook!
encoder.load_state_dict(torch.load("/kaggle/input/encodercnn/pytorch/default/1/encoder.pth"))
encoder.to(device)

In [None]:
def linear_layer(dim_input, dim_output, drop_p=0.1, last=False):
    layers = [nn.Linear(dim_input, dim_output)]
    if not last:
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(p=drop_p))
    return nn.Sequential(*layers)

In [None]:
class FF_mapper(nn.Module):

    def __init__(self, dim_input, dim_output):
        super().__init__()
        self.ff = nn.Sequential(
            linear_layer(dim_input, 640),
            linear_layer(640, 896),
            #linear_layer(896, 1024),
            linear_layer(896, dim_output, last=True),
            nn.LayerNorm(dim_output)
        )

    def forward(self, latent_space):
        # flatter, permute and stuff
        batch_size, C, H, W = latent_space.shape
        latent_space = latent_space.permute(0, 2, 3, 1)  # (1, 12, 12, 512)
        latent_space = latent_space.view(batch_size, H * W, C)  # (1, 144, 512)
        return self.ff(latent_space)

mapper = FF_mapper(512, 768).to(device)

# In this case, we use the ff_mapper with embedding approach (cosine similarity as loss function) 
# as mapper. However, we also implemented an ff_mapper with a token approach (which uses cross entropy
# as loss function) but it's not effective as this one. You can find more on these two models on the
# ff-mapper Notebook! 
mapper.load_state_dict(torch.load("/kaggle/input/mapper/pytorch/cos_sim/2/ff_mapper_GPT2.pth"))
mapper.to(device)

#### **Now we import the pre-trained transformer (GPT2) and start working on it**

In [None]:
# In this case, we use GPT2 as transformer. However, we also tried to use BioGPT
# but it's not effective as GPT2. You can find more on these two models on the
# TransformerChestX Notebook!
transformer = GPT2LMHeadModel.from_pretrained("gpt2")

for param in transformer.parameters():
    param.requires_grad = False  # Freezes all transformer parameters

transformer.to(device)

In [None]:
# function used to generate a report:
# the transformer takes as input some embeddings (inputs_embeds) and corresponding attention masks
def generate_text(transformer, inputs_embeds, attention_mask):
    return transformer.generate(
        inputs_embeds=inputs_embeds, 
        max_length=288,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=2,   # avoid repetitions
        #top_k=50,   # consider only the 50 most probable words
        eos_token_id=None,
        do_sample=False
    )

#### **Here there's the structure of the main CustomAutoencoder**

In [None]:
# General Autoencoder
class CustomAutoencoder(torch.nn.Module):
    def __init__(self, encoder, mapper, transformer):
        super().__init__()

        # The encoder takes images as input and encodes them in a latent space
        self.encoder = encoder

        # Adapt the latent space dimensions
        self.mapper = mapper
    
        # The decoder should take the latent space (images) and generate the report
        self.decoder = transformer
    
    def forward(self, images, attention):

        # the latent space computed by the encoder
        latent_space = encoder(images).to(device)

        # the embeddings (computed by the mapper), which the transformer will take as input
        pred_embeds = self.mapper(latent_space)

        # return the (tokenized) text generated by the transformer
        return generate_text(self.decoder, pred_embeds, attention)

In [None]:
final_model = CustomAutoencoder(encoder, mapper, transformer)

In [None]:
def generate_text_from_dataset(loader, modell, tokenizr):
    data_iter = iter(loader)
    image, text, attention = next(data_iter)
    
    print(f"Real Text:\n{tokenizer.decode(text[0], skip_special_tokens=True)}\n\n")
    
    image = image.to(device)
    text = text.to(device)
    attention = attention.to(device)
    
    with torch.no_grad():
        predicted_text = modell(image, attention).to(device)
    
    print(f"Predicted Text:\n{tokenizr.decode(predicted_text[0], skip_special_tokens=True)}")

generate_text_from_dataset(train_loader, final_model, tokenizer)

### **Building a test set**

- We built a test set taking data from a new dataset (https://huggingface.co/datasets/Sina-Alinejad-2002/train_chexpert)

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("Sina-Alinejad-2002/train_chexpert", split="train[:2000]")

# convert into DataFrame
df = dataset.to_pandas()

df.head()

In [None]:
df.shape

In [None]:
from IPython.display import display

display(dataset[0]['image'])

In [None]:
from PIL import Image
from io import BytesIO

class HuggingFaceChestXRayDataset(Dataset):
    def __init__(self, df, tokenizer, transform=None):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        
        # Load image from bytes
        image_bytes = row["image"]["bytes"]
        image = Image.open(BytesIO(image_bytes)).convert("L")

        if self.transform:
            image = self.transform(image)

        # Tokenize report
        encoded_text = self.tokenizer(
            row["report"],
            padding="max_length",
            truncation=True,
            max_length=144,
            return_tensors="pt"
        )

        input_ids = encoded_text["input_ids"].squeeze(0)
        attention_mask = encoded_text["attention_mask"].squeeze(0)

        return image, input_ids, attention_mask

# tf (transform) and tokenizer defined above
test_set = HuggingFaceChestXRayDataset(df, tokenizer, tf)

# Test a sample
image, input_ids, attention_mask = test_set[0]

print("Image shape:", image.shape)
print("Input IDs:", input_ids.shape)
print("Attention Mask:", attention_mask.shape)

In [None]:
# take a random row from the new dataset 
image, input_ids, attention_mask = test_set[0]

# display the new image
plt.imshow(image.squeeze(0), cmap="gray")
plt.title("Chest X-Ray")
plt.axis("off")
plt.show()

In [None]:
# generate a text from the test_set
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=True)
generate_text_from_dataset(test_loader, final_model, tokenizer)

## **Metrics application**

- Here we apply some metrics (**ROUGE** and **BLEU**) to our model to analyze its behavior

In [None]:
!pip install rouge-score

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import random

In [None]:
def calculate_rouge(prediction, reference):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(reference, prediction)
    return scores['rouge1'].fmeasure, scores['rouge2'].fmeasure, scores['rougeL'].fmeasure


def calculate_metrics(val_set, modell, sample_ratio=0.3):
    sample_size = int(len(val_set) * sample_ratio)
    
    sample_indices = random.sample(range(len(val_set)), sample_size)
    
    # Subset
    val_sample = Subset(val_set, sample_indices)
    
    # We create a new dataloader
    val_sample_loader = DataLoader(val_sample, batch_size=1, shuffle=False)
    
    bleu_scores = []
    
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []
    
    for images, input_ids, attention_mask in val_sample_loader:
        images = images.to(device)
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
    
        with torch.no_grad():
            predicted_text = modell(images, attention_mask).to(device)
    
        pred_text = tokenizer.decode(predicted_text[0], skip_special_tokens=True)
        true_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
    
        # We calculate bleu 
        bleu_result = sentence_bleu([true_text.split()], pred_text.split())
    
        bleu_scores.append(bleu_result)
    
        # We calculate rouge
        rouge1, rouge2, rougeL = calculate_rouge(true_text, pred_text)
        rouge1_scores.append(rouge1)
        rouge2_scores.append(rouge2)
        rougeL_scores.append(rougeL)

    return bleu_scores, rouge1_scores, rouge2_scores, rougeL_scores

bleu_scores, rouge1_scores, rouge2_scores, rougeL_scores = calculate_metrics(test_set, final_model)

#### **Plot of BLEU metric**

In [None]:
def plot_bleu(scores, save_path):
    mean_bleu = np.mean(scores)
    print(f"Mean BLEU on sampling: {mean_bleu:.4f}")
    
    # Histogram
    plt.figure(figsize=(12, 8))
    plt.hist(scores, bins=20, range=(0, 1), color='skyblue', edgecolor='black')
    plt.title('BLEU scores distribution')
    plt.xlabel('BLEU score')
    plt.ylabel('Frequency')
    plt.xticks(np.linspace(0, 1, 21))  # Tick every 0.05, 21 values
    plt.grid(True)
    plt.savefig(f"{save_path}_bleu.png")
    plt.show()

save_path = "metrics_analysis"
plot_bleu(bleu_scores, save_path=save_path)

#### **Plot of ROUGE metric**

In [None]:
def plot_rouge(rouge1, rouge2, rougeL, save_path):
    mean_rouge1 = np.mean(rouge1)
    mean_rouge2 = np.mean(rouge2)
    mean_rougeL = np.mean(rougeL)
    
    print(f"Mean ROUGE-1: {mean_rouge1:.4f}")
    print(f"Mean ROUGE-2: {mean_rouge2:.4f}")
    print(f"Mean ROUGE-L: {mean_rougeL:.4f}")
    
    # Histogram for ROUGE-1
    plt.figure(figsize=(12, 8))
    plt.hist(rouge1, bins=20, range=(0, 1), color='lightcoral', edgecolor='black')
    plt.title('Distribuzione ROUGE-1 scores')
    plt.xlabel('ROUGE-1 score (0 → 1)')
    plt.ylabel('Frequency')
    plt.xticks(np.linspace(0, 1, 21))  # Tick every 0.05, 21 values
    plt.grid(True)
    plt.savefig(f"{save_path}_rouge1.png")
    plt.show()
    
    # Histogram for ROUGE-2
    plt.figure(figsize=(12, 8))
    plt.hist(rouge2, bins=20, range=(0, 1), color='lightgreen', edgecolor='black')
    plt.title('ROUGE-2 scores distribution')
    plt.xlabel('ROUGE-2 score (0 → 1)')
    plt.ylabel('Frequency')
    plt.xticks(np.linspace(0, 1, 21))  # Tick every 0.05, 21 values
    plt.grid(True)
    plt.savefig(f"{save_path}_rouge2.png")
    plt.show()
    
    # Histogram for ROUGE-L
    plt.figure(figsize=(12, 8))
    plt.hist(rougeL, bins=20, range=(0, 1), color='lightblue', edgecolor='black')
    plt.title('ROUGE-L scores distribution')
    plt.xlabel('ROUGE-L score (0 → 1)')
    plt.ylabel('Frequency')
    plt.xticks(np.linspace(0, 1, 21))  # Tick every 0.05, 21 values
    plt.grid(True)
    plt.savefig(f"{save_path}_rougeL.png")
    plt.show()

plot_rouge(rouge1_scores, rouge2_scores, rougeL_scores, save_path=save_path)

#### **We build the model using BioGPT as transformer**
- In order to calculate the metrics for both GPT2 and BioGPT  

In [None]:
# for BioGPT tokenizer
!pip install sacremoses

In [None]:
# initialize the tokenizer
biogpt_tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt") 
biogpt_tokenizer.pad_token = biogpt_tokenizer.eos_token

# initialize the model
biogpt = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device)
biogpt_hidden_size = biogpt.config.hidden_size

# initialize the dataset
biogpt_test_set = HuggingFaceChestXRayDataset(df, biogpt_tokenizer, tf)

# initialize the dataLoader
biogpt_test_loader = DataLoader(biogpt_test_set, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True)

# import the biogpt ff_mapper
biogpt_mapper = FF_mapper(512, biogpt_hidden_size).to(device)
biogpt_mapper.load_state_dict(torch.load("/kaggle/input/ff_mapper_biogpt/pytorch/default/1/ff_mapper_BioGPT.pth"))
biogpt_mapper.to(device)

# initialize the custom autoencoder with biogpt_mapper and BioGPT 
biogpt_autoencoder_model = CustomAutoencoder(encoder, biogpt_mapper, biogpt)

# generate a text from a dataset row
generate_text_from_dataset(biogpt_test_loader, biogpt_autoencoder_model, biogpt_tokenizer)
biogpt_save_path = "metrics_analysis_biogpt"

#### **Checking metrics on the BioGPT model**

In [None]:
biogpt_bleu_scores, biogpt_rouge1_scores, biogpt_rouge2_scores, biogpt_rougeL_scores = calculate_metrics(biogpt_test_set, biogpt_autoencoder_model)

In [None]:
plot_bleu(biogpt_bleu_scores, save_path=biogpt_save_path)

In [None]:
plot_rouge(biogpt_rouge1_scores, biogpt_rouge2_scores, biogpt_rougeL_scores, save_path=biogpt_save_path)