# Fine Tuning of Feature encoder

A very common way of applying deep learning techniques in digital pathology is Multiple Instance Learning (MIL) 
The giga pixel image is cropped into a set of equally sized non overlapping tiles. The tiles are encoded with a pretrained feature extractor and then aggregated within in a second step by a trainable model to solve a specific task. 
Tipically used feature encoders were pretrained either on Imagenet21k or even on larger histological datasets. The architectures of such models can reach from smaller ones like the Resnet18  up to Swim transformer based transformer architectures. 
Since pre training larger models from scratch requires huge amounts of ressources, the focus of this model is fine tuning a pretrained model. 

**The following experiments attempt to explore whether existing feature encoders can be fine-tuned for better survival analysis.**


The aim is to carry out the following experiments:

1. Resnet18: train from scratch/fine tune on Survival Analysis
2. ViT Tiny: train from scratch/fine tune on Survival Analysis (+Multimodality)
3. Vit Tiny MAE: fine tune on Survival Analysis 
4. Vit Tiny MAE: fine tune on Survival Analysis in a supMAE fashion

To aquire those experiments, the following subtasks are needed: 
1. Create a custom dataset(from zip) that statifies on a patient level into a train/test split.
2. Create models,find checkpoints,load parameters such that DDP is applicable
3. Create a training function which allows partially freezing weights, finetune, train from checkpoint.
4. Create an encoding pipeline. 


# DataLoader

The Data consists of ~1000 patients. Each patient has exactly one genetic feature vector and can have multiple sets(can have multiple slides) of tile-sets.

Idea: 

The dataloader receives a dataframe which contains meta data and genetic data (stratified by train/test split on a patient level)
and further a path to the tiles. From this path, an os.walk is done to create a second dataframe which contains the file path and the slide_id.
the dataframe will be adapted to contain the tile_path, meta data and the index of the respecitve row within the genomic tensor

1. Import Dataframe, create Gen tensor, metadataframe
2. os.walk on tilepath to create tile dataframe,
3.  add mapping for slide_id idx


In [1]:
# 1.)
from utils.Aggregation_Utils import *
df_train,df_test = prepare_csv(df_path="/work4/seibel/PORPOISE/datasets_csv/tcga_brca_all_clean.csv.zip",split="traintest",n_bins=4,save = True,frac_train=0.7,frac_train=0.1)
df_train = df_train[df_train["traintest"]==1] # if train 



genomics_tensor = torch.Tensor(df_train[df_train.keys()[11:]].to_numpy()).to(torch.float32)
df_meta = df_train[["slide_id","survival_months_discretized","censorship","survival_months"]]
diction = dict([(name,idx) for idx,name in enumerate(df_meta["slide_id"]) ])

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
#2.)
import os
import pandas as pd
tile_path = "/work4/seibel/data/TCGA-BRCA-TILES/"
ext = "jpg"
file_list = []
root_list = []
for root, dirs, files in os.walk(tile_path, topdown=False):
    for name in files:
        file_list.append(os.path.join(root, name))
        root_list.append(root.split("/")[-1]+".svs")

df_tiles = pd.DataFrame({"tilepath":file_list,"slide_id":root_list},)
df_tiles = df_tiles[df_tiles["tilepath"].str.endswith(ext)] # Avoid having other files than .<ext> files in Dataframe


print(df_tiles.tilepath.iloc[0])
print(df_tiles.slide_id.iloc[0])
df_tiles.head()

/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A5-D951CA8BA35D/TCGA-A7-A13E-01Z-00-DX2_(23973,31270).jpg
TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A5-D951CA8BA35D.svs


Unnamed: 0,tilepath,slide_id
0,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...
1,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...
2,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...
3,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...
4,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...


In [17]:
# 3.)
df_tiles.insert(2,"slideid_idx",df_tiles["slide_id"].map(diction))
df_tiles = df_tiles.dropna()
df_tiles.slideid_idx = df_tiles.slideid_idx.astype(int)
df_tiles

Unnamed: 0,tilepath,slide_id,slideid_idx
0,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...,130
1,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...,130
2,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...,130
3,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...,130
4,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-A7-A13...,TCGA-A7-A13E-01Z-00-DX2.1E1262AE-A32D-4814-94A...,130
...,...,...,...
470043,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-BH-A0E...,TCGA-BH-A0E7-01Z-00-DX1.7FEE54C4-3795-403C-85B...,508
470044,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-BH-A0E...,TCGA-BH-A0E7-01Z-00-DX1.7FEE54C4-3795-403C-85B...,508
470045,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-BH-A0E...,TCGA-BH-A0E7-01Z-00-DX1.7FEE54C4-3795-403C-85B...,508
470046,/work4/seibel/data/TCGA-BRCA-TILES/TCGA-BH-A0E...,TCGA-BH-A0E7-01Z-00-DX1.7FEE54C4-3795-403C-85B...,508


In [42]:
from torch.utils.data import DataLoader,Dataset
from utils.Aggregation_Utils import *
import pandas as pd
import os 
from PIL import Image
from torchvision import transforms



class TileDataset(Dataset):
    def __init__(self,df_path,tilepath,ext,trainmode):
        """Custom Dataset for Feature Extractor Finetuning for Survival Analysis 

        Args:
            df_path (str): Path to Dataframe which contains meta data and genomic data 
            tilepath (str): path to folder which contains subfolders with tiles(subfolder names must ne slide id)
            ext (str): file extension of tiles(eg jpg or png)
            trainmode (Bool): To generate train set or test set 
        """
        super(TileDataset,self).__init__()
        #Genomic Tensor and Meta Dataframe
        df = pd.read_csv(df_path) 
        df[df["traintest"]==(1 if trainmode else 0)]
        
        self.genomics_tensor = torch.Tensor(df_train[df_train.keys()[11:]].to_numpy()).to(torch.float32)
        self.df_meta = df_train[["slide_id","survival_months_discretized","censorship","survival_months"]]
        
        # Tile Data Frame
        file_list = []
        root_list = []
        for root, dirs, files in os.walk(tile_path, topdown=False):
            for name in files:
                file_list.append(os.path.join(root, name))
                root_list.append(root.split("/")[-1]+".svs")

        df_tiles = pd.DataFrame({"tilepath":file_list,"slide_id":root_list},)
        df_tiles = df_tiles[df_tiles["tilepath"].str.endswith(ext)]
        
        # add slide_id to index mapping
        diction= dict([(name,idx) for idx,name in enumerate(df_meta["slide_id"]) ]) 
        df_tiles.insert(2,"slideid_idx",df_tiles["slide_id"].map(diction))
        df_tiles = df_tiles.dropna()
        df_tiles.slideid_idx = df_tiles.slideid_idx.astype(int)
        self.df_tiles = df_tiles
        
        # TODO transforms 
        self.transforms = transforms.Compose([transforms.ToTensor(),transforms.Normalize(mean=(0.5,0.5,0.5),std=(0.5,0.5,0.5)),transforms.RandomHorizontalFlip(p=0.5),transforms.RandomVerticalFlip(p=0.5)])
    def __len__(self):
        return len(df_tiles)
    def __getitem__(self,idx):
        
        tile_path,_,slide_idx = self.df_tiles.iloc[idx]
        tile = Image.open(tile_path)
        tile = self.transforms(tile)
        
        label = torch.tensor(self.df_meta.iloc[slide_idx, 1]).type(torch.int64)
        censorship = torch.tensor(self.df_meta.iloc[slide_idx, 2]).type(torch.int64)
        label_cont = torch.tensor(self.df_meta.iloc[slide_idx,3]).type(torch.int64)
        return tile, self.genomics_tensor[slide_idx], censorship, label,label_cont
        
        
        
df_path_train = "/work4/seibel/PORPOISE/datasets_csv/tcga_brca__4bins_trainsplit.csv"
df_path_test = "/work4/seibel/PORPOISE/datasets_csv/tcga_brca__4bins_testsplit.csv"

tilepath = "/work4/seibel/data/TCGA-BRCA-TILES/"
ext = "jpg"
trainmode=True


DS = TileDataset(df_path_train,tilepath,ext,trainmode)
DS.__getitem__(8)

(tensor([[[0.8667, 0.8745, 0.8667,  ..., 0.8745, 0.9137, 0.8824],
          [0.8745, 0.8588, 0.8118,  ..., 0.8824, 0.9137, 0.8824],
          [0.8510, 0.8588, 0.8353,  ..., 0.8902, 0.9059, 0.8824],
          ...,
          [0.4745, 0.4824, 0.5216,  ..., 0.8745, 0.8745, 0.8824],
          [0.5686, 0.4824, 0.5529,  ..., 0.8745, 0.8745, 0.8824],
          [0.6784, 0.5137, 0.6000,  ..., 0.8745, 0.8745, 0.8824]],
 
         [[0.8980, 0.9059, 0.8980,  ..., 0.8588, 0.8980, 0.8667],
          [0.9059, 0.8902, 0.8431,  ..., 0.8667, 0.8980, 0.8667],
          [0.8588, 0.8667, 0.8431,  ..., 0.8745, 0.8902, 0.8667],
          ...,
          [0.2549, 0.2784, 0.3490,  ..., 0.8745, 0.8745, 0.8824],
          [0.3412, 0.2784, 0.3804,  ..., 0.8745, 0.8745, 0.8824],
          [0.4510, 0.3098, 0.4275,  ..., 0.8745, 0.8745, 0.8824]],
 
         [[0.9059, 0.9137, 0.8902,  ..., 0.8824, 0.9059, 0.8745],
          [0.9137, 0.8980, 0.8510,  ..., 0.8902, 0.9059, 0.8745],
          [0.8902, 0.8980, 0.8745,  ...,

In [9]:
import torch
from torch import nn
channels = 12
model = nn.Sequential(nn.Conv2d(3, channels, kernel_size=(3, 3), padding='same', bias=True),
                          nn.AdaptiveAvgPool2d(1),
                          nn.Flatten(1),
                          nn.Linear(channels,1),
                          nn.Flatten(0))



x = torch.rand((5,3,16,16))
model(x).size()

torch.Size([5])

In [42]:
import argparse
import yaml
import os
f = "./encoder_configs/base.yaml"
os.path.exists(f)
with open(f, 'r') as file:
        config = yaml.safe_load(file)
        
print(config["train_settings"]["checkpoint_path"])


None


In [None]:
DF1 = ['case_id', 'slide_id', 'site', 'traintest', 'metas','gendata']
DF2 = ['TILEPATH','SLIDE_ID']
#get one tensor for gendata, one df with tilepath, metadata, tensoridx

In [60]:
mode = "vali"
assert mode in ["train","test","val"], "Dataset mode not known"
df_train.survival_months_discretized[df_train.survival_months_discretized==( 0 if mode=="train" else 1 if mode=="test" else 2)]

AssertionError: Dataset mode not known

In [49]:
if else 