# **Spatial Transcriptomics Deep Learning (STDL) Project Notebook**

> The notebook contains main experiments and examples of how to use the code

## **Phase 1: Pre-processing and technical preparations**

### 1.1: **Assign GPU device and allow CUDA debugging**

In [1]:
# create code to reimport module if i change it
%load_ext autoreload

In [2]:
# the next 2 lines are to allow debugging with CUDA !
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"  
print(f'cuda debugging allowed')

cuda debugging allowed


In [3]:
%%time

import torch
print(f'cuda device count: {torch.cuda.device_count()}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
#Additional Info when using cuda
if device.type == 'cuda':
    print(f'device name: {torch.cuda.get_device_name(0)}')
    print(f'torch.cuda.device(0): {torch.cuda.device(0)}')
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')
# NOTE: important !!!!!!
# clearing out the cache before beginning
torch.cuda.empty_cache()

cuda device count: 1
Using device: cuda
device name: GeForce RTX 2080 Ti
torch.cuda.device(0): <torch.cuda.device object at 0x7f31302e4a10>
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB
CPU times: user 1.88 s, sys: 6.16 s, total: 8.04 s
Wall time: 5.02 s


### 1.2: **Import the Pre-Process module**

> `loadAndPreProcess` module contains methods to load the data files as pytorch and pandas objects, methods to preprocess the given data, and methods to create custom datasets from the preprocessed data.

<div class="alert alert-block alert-warning">
<b>TODO:</b> fill above line
</div>

In [4]:
# note: path to project is: /home/roy.rubin/STDLproject/
%autoreload 2
import loadAndPreProcess

### 1.3: **Load pytorch dataset objects from the image folder**

> loading regular and augmented datasets created from the given image folder with transformations.

> Note: `augmentedImageFolder` is a custom dataset of imageFolder objects with different transformations (see code).

> Note: `im_hight_and_width_size` will define the size to which the images in the folder will be resized to. their original size 176, and so if the number will be bigger, the images will be automaticaly upsampled in the `resize` (not sure by what method) - which means images might be "pixelized" / lower quality. The problem is, size 176 doesnt work with all models, so i had to increase the size.

In [5]:
im_hight_and_width_size = 176  # values: 176 (doesnt work with inception) / 224 (doesnt work with inception) / 299 (works with inception)

In [6]:
%%time

path_to_images_dir_patient1_train = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient1/images"
imageFolder_train = loadAndPreProcess.load_dataset_from_images_folder(path_to_images_dir_patient1_train, im_hight_and_width_size)
augmentedImageFolder_train = loadAndPreProcess.load_augmented_imageFolder_DS_from_images_folder(path_to_images_dir_patient1_train, im_hight_and_width_size)


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----

CPU times: user 263 ms, sys: 27.7 ms, total: 291 ms
Wall time: 512 ms


In [7]:
%%time

path_to_images_dir_patient2_test = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient2/images"
imageFolder_test = loadAndPreProcess.load_dataset_from_images_folder(path_to_images_dir_patient2_test, im_hight_and_width_size)
# augmentedImageFolder_test = loadAndPreProcess.load_augmented_imageFolder_DS_from_images_folder(path_to_images_dir_patient2_test, im_hight_and_width_size) # not needed for now


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----

CPU times: user 21.6 ms, sys: 6.21 ms, total: 27.8 ms
Wall time: 58.9 ms


### 1.4: **Load pandas dataframe objects from the given mtx/tsv/csv files**

> `matrix_dataframe` represents the gene expression count values of each sample for each gene

> `features_dataframe` contains the names of all the genes

> `barcodes_dataframe` contains the names of all the samples

In [None]:
%%time

path_to_mtx_tsv_files_dir_patient1_train = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient1"
matrix_dataframe_train, features_dataframe_train , barcodes_dataframe_train = loadAndPreProcess.load_dataframes_from_mtx_and_tsv_new(path_to_mtx_tsv_files_dir_patient1_train)


----- entered function load_dataframes_from_mtx_and_tsv -----
started reading features.tsv
V  finished reading features.tsv
started reading barcodes.tsv
V  finished reading barcodes.tsv
started reading matrix.mtx. this might take some time ...


In [None]:
%%time

path_to_mtx_tsv_files_dir_patient2_test = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient2"
matrix_dataframe_test, features_dataframe_test , barcodes_dataframe_test = loadAndPreProcess.load_dataframes_from_mtx_and_tsv_new(path_to_mtx_tsv_files_dir_patient2_test)

### 1.5: **Remove samples from the matrix dataframe with no matching images in the image folder**

> Note: indices are being reset after this action, so a mapping of old to new column indices is returned: `column_mapping`.

> Note: the dataframe is also reordered according to the images order in the image folder

In [None]:
%%time

matrix_dataframe_train, column_mapping_train = loadAndPreProcess.cut_samples_with_no_matching_image_and_reorder_df(matrix_df=matrix_dataframe_train, 
                                                                                                                    image_folder_of_the_df=imageFolder_train, 
                                                                                                                    barcodes_df=barcodes_dataframe_train)

In [None]:
%%time

matrix_dataframe_test, column_mapping_test = loadAndPreProcess.cut_samples_with_no_matching_image_and_reorder_df(matrix_df=matrix_dataframe_test, 
                                                                                                                  image_folder_of_the_df=imageFolder_test, 
                                                                                                                  barcodes_df=barcodes_dataframe_test)

### 1.6: **Remove less-informative genes**

> we define *less-informative* genes as genes with less than K counts over all samples

> `Base_value` is a parameter for the user's choice

> Note: indices are being reset after this action, so a mapping of old to new column indices is returned: `row_mapping`.

In [None]:
%%time

# begin by asserting that our dataframes have the same genes to begin with using the metadata of features_dataframe
assert features_dataframe_train['gene_names'].equals(features_dataframe_test['gene_names'])

Base_value = 10
matrix_dataframe_train, matrix_dataframe_test, row_mapping = loadAndPreProcess.cut_genes_with_under_B_counts_from_train_and_test(matrix_dataframe_train, matrix_dataframe_test, Base_value) 

### 1.7: **Normalize matrix_dataframe entries**

> normaliztion will be performed on the remainning rows of the dataframe with the logic "log 1P"

> This method Calculates log(1 + x)

In [None]:
%%time

matrix_dataframe_train = loadAndPreProcess.perform_log_1p_normalization(matrix_dataframe_train) 

In [None]:
%%time

matrix_dataframe_test = loadAndPreProcess.perform_log_1p_normalization(matrix_dataframe_test) 

> We have performed all of the pre-processing actions on our matrix dataframes. (more pre-processing is still needed our datasets)

> print some information regarding our dataframes

In [None]:
import projectUtilities
projectUtilities.printInfoAboutReducedDF(matrix_dataframe_train)
print("\n****\n")
projectUtilities.printInfoAboutReducedDF(matrix_dataframe_test)

### 1.8: **Create custom datasets**

> Each custom dataset is tailored per task

> there are four tasks: single gene prediction, k gene prediction, all gene prediction using NMF dim. reduction, all gene prediction using AE dim. reduction

> For each of the above tasks 2 datasets were created:

>> A Dataset created from the TRAIN data WITHOUT augmentation (without image transformations)

>> A Dataset created from the TRAIN data WITH augmentation (with image transformations)

>> A Dataset created from the TEST data WITHOUT augmentation (without image transformations)

In [None]:
%%time
gene_name = 'MKI67'
custom_DS_SingleValuePerImg = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=imageFolder_train, 
                                                               matrix_dataframe=matrix_dataframe_train, 
                                                               features_dataframe=features_dataframe_train, 
                                                               barcodes_dataframe=barcodes_dataframe_train, 
                                                               column_mapping=column_mapping_train,
                                                               row_mapping=row_mapping,
                                                               chosen_gene_name=gene_name)
custom_DS_SingleValuePerImg_augmented = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=augmentedImageFolder_train, 
                                                               matrix_dataframe=matrix_dataframe_train, 
                                                               features_dataframe=features_dataframe_train, 
                                                               barcodes_dataframe=barcodes_dataframe_train, 
                                                               column_mapping=column_mapping_train,
                                                               row_mapping=row_mapping,
                                                               chosen_gene_name=gene_name)
custom_DS_SingleValuePerImg_test = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=imageFolder_test, 
                                                               matrix_dataframe=matrix_dataframe_test, 
                                                               features_dataframe=features_dataframe_test, 
                                                               barcodes_dataframe=barcodes_dataframe_test, 
                                                               column_mapping=column_mapping_test,
                                                               row_mapping=row_mapping,
                                                               chosen_gene_name=gene_name)

<div class="alert alert-block alert-info">
<b>Note:</b> inside the init phase of `STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance` class, K genes with the highest variance are chosen from matrix_dataframe, and they are the only genes that are kept for training and testing purposes
</div>

In [None]:
%%time

k = 10
custom_DS_KGenesWithHighestVariance = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k)
custom_DS_KGenesWithHighestVariance_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k)
custom_DS_KGenesWithHighestVariance_test = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=imageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_dataframe=barcodes_dataframe_test, 
                                                                           column_mapping=column_mapping_test,
                                                                           num_of_dims_k=k)

<div class="alert alert-block alert-info">
<b>Note:</b> inside the init phase of `STDL_Dataset_KValuesPerImg_LatentTensor_NMF` class, an NMF decompositionis performed on the matrix_dataframe object
</div>

In [None]:
%%time

k = 10
custom_DS_LatentTensor_NMF = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k)
custom_DS_LatentTensor_NMF_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k)
custom_DS_LatentTensor_NMF_test = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=imageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_dataframe=barcodes_dataframe_test, 
                                                                           column_mapping=column_mapping_test,
                                                                           num_of_dims_k=k)

<div class="alert alert-block alert-info">
<b>Note:</b> 
<ul>
  <li>first we create a dataset from `matrix_dataframe_train` to feed our AEnet.</li>
  <li>Then we create our AEnet and train it.</li>
  <li>Finally, we create our `custom_DS_LatentTensor_AE` class, in which the Autoencoder network will be saved.</li>
</ul>
</div>

In [None]:
dataset_from_matrix_df = loadAndPreProcess.STDL_Dataset_matrix_df_for_AE_init(matrix_dataframe_train)

In [None]:
%autoreload 2

from executionModule import get_Trained_AEnet
k = 10
AEnet = get_Trained_AEnet(dataset_from_matrix_df=dataset_from_matrix_df, z_dim=k, num_of_epochs=3, device=device)

In [None]:
%%time

k = 10
custom_DS_LatentTensor_AE = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           AEnet=AEnet,
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k,
                                                                           device=device)
custom_DS_LatentTensor_AE_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_dataframe=barcodes_dataframe_train, 
                                                                           AEnet=AEnet,                                                                                                            
                                                                           column_mapping=column_mapping_train,
                                                                           num_of_dims_k=k,
                                                                           device=device)
custom_DS_LatentTensor_AE_test = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=imageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_dataframe=barcodes_dataframe_test, 
                                                                           AEnet=AEnet,                                                                                                       
                                                                           column_mapping=column_mapping_test,
                                                                           num_of_dims_k=k,
                                                                           device=device)

### 1.9: prepare for the next phases in which the experiments are executed

> import `executionModule` which contains the experiments, training methods, and testing methods

> create `hyperparameters` dictionary which will contain all of the hyper-parameters for our experiments (note - user can change these later)

> create `model_list` that will hold all the names for the models that will be used (only 3 models for now, as can be seen below). the models are:

>> `BasicConvNet` model

>> `DensetNet121` model

>> `Inception_V3` model

<div class="alert alert-block alert-warning">
<b>Warning:</b> change the hyper-parameters below with caution if needed !
</div>

In [None]:
%autoreload 2
import executionModule

# define hyperparameters for the experiments
hyperparameters = dict()
hyperparameters['batch_size'] = 25
hyperparameters['max_alowed_number_of_batches'] = 99999
hyperparameters['precent_of_dataset_allocated_for_training'] = 0.8
hyperparameters['learning_rate'] = 1e-4
hyperparameters['momentum'] = 0.9
hyperparameters['num_of_epochs'] = 3

# define hyperparameters for BsicConvNet
hyperparameters['channels'] = [32] 
hyperparameters['num_of_convolution_layers'] = len(hyperparameters['channels'])
hyperparameters['hidden_dims'] = [100]
hyperparameters['num_of_hidden_layers'] = len(hyperparameters['hidden_dims'])
hyperparameters['pool_every'] = 99999

# list of all models used
model_list = []
model_list.append('BasicConvNet')
model_list.append('DensetNet121')
# model_list.append('Inception_V3')  #TODO: still need to sort out input image size... !

<div class="alert alert-block alert-danger">
<b>Note:</b> inception_v3 model isnt sorted out yet... gives
    RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (5 x 5). Kernel size can't be greater than actual input size
</div>

> creating an assisting method for our testing that will time each experiment

In [None]:
def experiment_loop(ds_train, ds_test, phase_name):
    for model_name in model_list:
        print(f'\nstarting experiment **{model_name}**\n')
        %time executionModule.runExperiment(ds_train=ds_train, ds_test=ds_test, hyperparams=hyperparameters, device=device, model_name=model_name, dataset_name=phase_name)
        print(f'\nfinished experiment {model_name}')
    

## Phase 2: Single Gene Prediction

In [None]:
# experiment_loop(ds_train=custom_DS_SingleValuePerImg ,ds_test=custom_DS_SingleValuePerImg_test ,phase_name='single_gene')

In [None]:
# experiment_loop(ds_train=custom_DS_SingleValuePerImg_augmented ,ds_test=custom_DS_SingleValuePerImg_test ,phase_name='single_gene_augmented')

## Phase 3: K genes prediction

In [None]:
# experiment_loop(ds_train=custom_DS_KGenesWithHighestVariance ,ds_test=custom_DS_KGenesWithHighestVariance_test ,phase_name='k_genes')

In [None]:
# experiment_loop(ds_train=custom_DS_KGenesWithHighestVariance_augmented ,ds_test=custom_DS_KGenesWithHighestVariance_test ,phase_name='k_genes_augmented')

## Phase 4: All genes prediction - using dimensionality reduction techniques

### 4.1: Prediction using dimensionality reduction technique NMF

In [None]:
# experiment_loop(ds_train=custom_DS_LatentTensor_NMF ,ds_test=custom_DS_LatentTensor_NMF_test ,phase_name='NMF')

In [None]:
# experiment_loop(ds_train=custom_DS_LatentTensor_NMF_augmented ,ds_test=custom_DS_LatentTensor_NMF_test ,phase_name='NMF_augmented')

### 4.2: Prediction using dimensionality reduction technique AE

In [None]:
# experiment_loop(ds_train=custom_DS_LatentTensor_AE ,ds_test=custom_DS_LatentTensor_AE_test ,phase_name='AE')

In [None]:
experiment_loop(ds_train=custom_DS_LatentTensor_AE_augmented ,ds_test=custom_DS_LatentTensor_AE_test ,phase_name='AE_augmented')

<div class="alert alert-block alert-danger">
<b>Note:</b> below this - everything is a testing block
</div>