# EDA
In this notebook we describe the general pipeline and perform the Exploratory Data Analysis (EDA) of the data on the level of patches of the histological images that was generated in the previous steps.<br>
<p style="text-align:center;">
<img title="Patches generation workflow"
     alt="Alt text"
     src="./ims/data-generation-pipeline.png"
     width="800"
     height="500">
</p>

Colorectal Cancer (CRC) is the third most common cancer in the world. It is a heterogeneous disease, which means that it is not the same for all patients. [Consensus Molecular Subtypes (CMSs)](https://www.nature.com/articles/nm.3967) were develop to help medical doctors to provide better treatment for patients. CMSs are based on the gene expression profiles of the tumor and very expensive in terms of money. At the same time histological images are much cheaper and easier to obtain. Deep Learning based [approach](https://gut.bmj.com/content/70/3/544.long) was developed to infer CMSs from histological images only. The Computer Vision approach seemed to be very promising despite utilization of a bit dated [Inception V3](https://arxiv.org/abs/1512.00567) architecture. Hypothesis of this project is that the state-of-the-art architectures will perform better than Inception V3. The goal of this project is to test this hypothesis.

This notebook starts in the point where we have the pathces of histological images with corresponding labels in their filenames.

In [1]:
import pandas as pd

# import from file in the parent directory
import sys
sys.path.append('../')
from params import PATH_PATCHES, WANDB_PROJECT, ENTITY, RAW_DATA_AT
assert PATH_PATCHES.exists()

from fastai.vision.all import *
import wandb

In [2]:
df = pd.read_csv('../benchmarking.csv')
print(df.shape)
df.head()

(766, 3)


Unnamed: 0,model,F1,error
0,vit_large_r50_s32_224.augreg_in21k,0.836191,
1,resnet14t.c3_in1k,0.780034,
2,resnetv2_50x3_bit.goog_in21k_ft_in1k,0.73962,
3,resnetv2_50x3_bit.goog_in21k,0.745324,
4,vit_large_r50_s32_224.augreg_in21k_ft_in1k,0.781215,


In [3]:
# drop rows with missing values in F1 column
df = df.dropna(subset=['F1'])
df.shape

(763, 3)

In [4]:
# drop `error` column
df = df.drop(columns=['error'])
df.head()

Unnamed: 0,model,F1
0,vit_large_r50_s32_224.augreg_in21k,0.836191
1,resnet14t.c3_in1k,0.780034
2,resnetv2_50x3_bit.goog_in21k_ft_in1k,0.73962
3,resnetv2_50x3_bit.goog_in21k,0.745324
4,vit_large_r50_s32_224.augreg_in21k_ft_in1k,0.781215


In [5]:
# sort by F1
df = df.sort_values(by='F1', ascending=False)
df.head()

Unnamed: 0,model,F1
436,res2net101d.in1k,0.888734
10,rexnetr_300.sw_in12k_ft_in1k,0.857955
169,seresnextaa101d_32x8d.sw_in12k_ft_in1k_288,0.85483
179,coatnet_0_rw_224.sw_in1k,0.853199
0,vit_large_r50_s32_224.augreg_in21k,0.836191


In [6]:
# get rows that have the "inception" in the model name
df[df['model'].str.contains('inception')]

Unnamed: 0,model,F1
614,inception_v3.tf_in1k,0.546135
709,inception_resnet_v2.tf_ens_adv_in1k,0.541105
751,inception_v3.tv_in1k,0.514935
733,inception_v3.tf_adv_in1k,0.499923
565,inception_v3.gluon_in1k,0.479869
564,inception_resnet_v2.tf_in1k,0.462982
628,inception_v4.tf_in1k,0.437398


In [7]:
df[df['model'].str.contains('inception')]['model'].head(2).values

array(['inception_v3.tf_in1k', 'inception_resnet_v2.tf_ens_adv_in1k'],
      dtype=object)

In [8]:
# select names of top 8 models by F1 score + top 2 inception models
model_names = list(df['model'].head(8).values) + list(df[df['model'].str.contains('inception')]['model'].head(2).values)
model_names

['res2net101d.in1k',
 'rexnetr_300.sw_in12k_ft_in1k',
 'seresnextaa101d_32x8d.sw_in12k_ft_in1k_288',
 'coatnet_0_rw_224.sw_in1k',
 'vit_large_r50_s32_224.augreg_in21k',
 'resnext101_32x4d.fb_swsl_ig1b_ft_in1k',
 'vit_base_r50_s16_224.orig_in21k',
 'coatnet_rmlp_1_rw2_224.sw_in12k',
 'inception_v3.tf_in1k',
 'inception_resnet_v2.tf_ens_adv_in1k']

In [18]:
# save model names to file
with open('../models_to_benchmark.txt', 'w') as f:
    for item in model_names:
        f.write("%s\n" % item)

In [10]:
def get_label(fname):
    return fname.split('_')[-3]


def _create_table(image_files):
    """Create a table with images and corresponding labels."""
    table = wandb.Table(columns=['Image', 'Label', 'Fname', 'Split'])

    for i, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        image = Image.open(image_file)
        label = get_label(image_file.name)

        table.add_data(wandb.Image(image),
                       label,
                       str(image_file.name),
                       "None") # we don't have a split column yet

    return table

We will start a new W&B `run` and put everything into a raw Artifact.

In [11]:
run = wandb.init(project=WANDB_PROJECT, entity=ENTITY, job_type='upload')
raw_data_at = wandb.Artifact(RAW_DATA_AT, type='raw_data')

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mapopov[0m ([33mijc-amp[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [12]:
raw_data_at.add_dir(PATH_PATCHES, name='patches')

[34m[1mwandb[0m: Adding directory to artifact (/mnt/data/ijc-histology-data/TCGA-COAD-patches-5-percent)... Done. 0.5s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [13]:
image_files = get_image_files(PATH_PATCHES, recurse=False)

In [14]:
table = _create_table(image_files)

Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`.

In [15]:
raw_data_at.add(table, "labels_table")

ArtifactManifestEntry(path='labels_table.table.json', digest='/x+PtP9C+xW2hIvySl7wOA==', size=493269, local_path='/home/anton/.local/share/wandb/artifacts/staging/tmpc4unq51p')

In [16]:
run.log_artifact(raw_data_at)
run.finish()