# **Spatial Transcriptomics Deep Learning (STDL) Project Notebook**

> The notebook contains main experiments and examples of how to use the code

## **Phase 1: Pre-processing and technical preparations**

### 1.1: **Assign GPU device and allow CUDA debugging**

In [1]:
# the next 2 lines are to allow debugging with CUDA !
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"  
print(f'cuda debugging allowed')

cuda debugging allowed


In [2]:
%%time

import torch
print(f'cuda device count: {torch.cuda.device_count()}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
#Additional Info when using cuda
if device.type == 'cuda':
    print(f'device name: {torch.cuda.get_device_name(0)}')
    print(f'torch.cuda.device(0): {torch.cuda.device(0)}')
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')
# NOTE: important !!!!!!
# clearing out the cache before beginning
torch.cuda.empty_cache()

cuda device count: 1
Using device: cuda
device name: GeForce GTX 1080 Ti
torch.cuda.device(0): <torch.cuda.device object at 0x7f0fef4f6b50>
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB
CPU times: user 1.1 s, sys: 2.8 s, total: 3.9 s
Wall time: 5.98 s


### 1.2: **Import some of the project modules (more will be loaded later)**

> `loadAndPreProcess` module contains methods to load the data files as pytorch and pandas objects, methods to preprocess the given data, and methods to create custom datasets from the preprocessed data.

> `deepNetworkArchitechture` module contains .... write this :sad:  :\

<div class="alert alert-block alert-warning">
<b>TODO:</b> fill above line
</div>

In [4]:
# create code to reimport module if i change it
%load_ext autoreload
%autoreload 2

# note: path to project is: /home/roy.rubin/STDLproject/
import loadAndPreProcess
import deepNetworkArchitechture

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1.3: **Load pytorch dataset objects from the image folder**

> Note that `augmentedImageFolder` is a custom dataset of imageFolder objects with different transformations (see code).

In [5]:
%%time

path_to_images_dir_patient1_train = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient1/images"
imageFolder_train = loadAndPreProcess.load_dataset_from_images_folder(path_to_images_dir_patient1_train)
augmentedImageFolder_train = loadAndPreProcess.load_augmented_imageFolder_DS_from_images_folder(path_to_images_dir_patient1_train)


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----

CPU times: user 203 ms, sys: 130 ms, total: 333 ms
Wall time: 1.3 s


In [6]:
%%time

path_to_images_dir_patient2_test = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient2/images"
imageFolder_test = loadAndPreProcess.load_dataset_from_images_folder(path_to_images_dir_patient2_test)
# augmentedImageFolder_test = loadAndPreProcess.load_augmented_imageFolder_DS_from_images_folder(path_to_images_dir_patient2_test) # not needed for now


----- entered function load_dataset_from_pictures_folder -----

----- finished function load_dataset_from_pictures_folder -----

CPU times: user 49.2 ms, sys: 93.2 ms, total: 142 ms
Wall time: 767 ms


### 1.4: **Load pandas dataframe objects from the 3 given tsv/csv files**

> `matrix_dataframe` represents the gene expression count values of each sample for each gene

> `features_dataframe` contains the names of all the genes

> `barcodes_datafame` contains the names of all the samples

In [7]:
%%time

path_to_mtx_tsv_files_dir_patient1_train = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient1"
matrix_dataframe_train, features_dataframe_train , barcodes_datafame_train = loadAndPreProcess.load_dataframes_from_mtx_and_tsv_new(path_to_mtx_tsv_files_dir_patient1_train)


----- entered function load_dataframes_from_mtx_and_tsv -----
started reading features.tsv
V  finished reading features.tsv
started reading barcodes.tsv
V  finished reading barcodes.tsv
started reading matrix.mtx. this might take some time ...
V  finished reading matrix.mtx

----- finished function load_dataframes_from_mtx_and_tsv -----

CPU times: user 1min 23s, sys: 931 ms, total: 1min 24s
Wall time: 1min 45s


In [8]:
%%time

path_to_mtx_tsv_files_dir_patient2_test = "/home/roy.rubin/STDLproject/spatialGeneExpressionData/patient2"
matrix_dataframe_test, features_dataframe_test , barcodes_datafame_test = loadAndPreProcess.load_dataframes_from_mtx_and_tsv_new(path_to_mtx_tsv_files_dir_patient2_test)


----- entered function load_dataframes_from_mtx_and_tsv -----
started reading features.tsv
V  finished reading features.tsv
started reading barcodes.tsv
V  finished reading barcodes.tsv
started reading matrix.mtx. this might take some time ...


ValueError: not enough values to unpack (expected 2, got 1)

### 1.5: **Remove less-informative genes**

> we define *less-informative* genes as genes with less than K counts over all samples

> K is a parameter for the user's choice

In [9]:
%%time

Base_value = 10
matrix_dataframe_train, mapping_between_old_and_new_indices_train = loadAndPreProcess.cut_genes_with_under_B_counts(matrix_dataframe_train, Base_value) 
# # uncomment if wanted:
# print(f'\nnote: this is the mapping_between_old_and_new_indices: \n{mapping_between_old_and_new_indices}')

cutting all genes (rows) that contain only zeros ...

print data regarding the reduced dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Columns: 4992 entries, 0 to 4991
dtypes: Sparse[int64, 0](4992)
memory usage: 245.0 MB
None
   0     1     2     3     4     5     6     7     8     9     ...  4982  \
0     0     0     0     0     0     0     0     0     0     1  ...     0   
1     0     0     0     2     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   
3     0     0     0     0     0     0     0     0     0     0  ...     0   
4     0     0     0     1     0     0     0     0     0     0  ...     0   

   4983  4984  4985  4986  4987  4988  4989  4990  4991  
0     0     0     0     0     0     0     0     0     0  
1     0     0     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  
3     0     0     0     0     0     0     0  

In [10]:
%%time

Base_value = 0
matrix_dataframe_test, mapping_between_old_and_new_indices_test = loadAndPreProcess.cut_genes_with_under_B_counts(matrix_dataframe_train, Base_value) 
#TODO: note that this actually is not really needed, but it is to not change the functions inside this class to adjust to the test data

# # uncomment if wanted:
# print(f'\nnote: this is the mapping_between_old_and_new_indices: \n{mapping_between_old_and_new_indices}')

cutting all genes (rows) that contain only zeros ...

print data regarding the reduced dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Columns: 4992 entries, 0 to 4991
dtypes: Sparse[int64, 0](4992)
memory usage: 245.0 MB
None
   0     1     2     3     4     5     6     7     8     9     ...  4982  \
0     0     0     0     0     0     0     0     0     0     1  ...     0   
1     0     0     0     2     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   
3     0     0     0     0     0     0     0     0     0     0  ...     0   
4     0     0     0     1     0     0     0     0     0     0  ...     0   

   4983  4984  4985  4986  4987  4988  4989  4990  4991  
0     0     0     0     0     0     0     0     0     0  
1     0     0     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  
3     0     0     0     0     0     0     0  

### 1.6: **Normalize matrix_dataframe entries**

> normaliztion will be performed on the remainning rows of the dataframe with the logic "log 1P"

> This method Calculates log(1 + x)

In [11]:
%%time

matrix_dataframe_train = loadAndPreProcess.perform_log_1p_normalization(matrix_dataframe_train) 

performing log1P transformation of the dataframe ...

CPU times: user 2.32 s, sys: 73.7 ms, total: 2.4 s
Wall time: 3.17 s


In [12]:
%%time

matrix_dataframe_test = loadAndPreProcess.perform_log_1p_normalization(matrix_dataframe_test) 

performing log1P transformation of the dataframe ...

CPU times: user 2.01 s, sys: 52.4 ms, total: 2.07 s
Wall time: 2.64 s


### 1.7: **Create custom datasets**

> Each custom dataset is tailored per task

> there are four tasks: single gene prediction, k gene prediction, all gene prediction using NMF dim. reduction, all gene prediction using AE dim. reduction

> For each of the above tasks 2 datasets were created - one with the regular images, and one with the augmented dataset - images with transformations.


In [13]:
%%time
gene_name = 'MKI67'
custom_DS_SingleValuePerImg = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=imageFolder_train, 
                                                               matrix_dataframe=matrix_dataframe_train, 
                                                               features_dataframe=features_dataframe_train, 
                                                               barcodes_datafame=barcodes_datafame, 
                                                               chosen_gene_name=gene_name,
                                                               index_mapping=mapping_between_old_and_new_indices_train)
custom_DS_SingleValuePerImg_augmented = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=augmentedImageFolder_train, 
                                                               matrix_dataframe=matrix_dataframe_train, 
                                                               features_dataframe=features_dataframe_train, 
                                                               barcodes_datafame=barcodes_datafame, 
                                                               chosen_gene_name=gene_name,
                                                               index_mapping=mapping_between_old_and_new_indices_train)
custom_DS_SingleValuePerImg_test = loadAndPreProcess.STDL_Dataset_SingleValuePerImg(imageFolder=imageFolder_test, 
                                                               matrix_dataframe=matrix_dataframe_test, 
                                                               features_dataframe=features_dataframe_test, 
                                                               barcodes_datafame=barcodes_datafame_test, 
                                                               chosen_gene_name=gene_name,
                                                               index_mapping=mapping_between_old_and_new_indices_test)

NameError: name 'barcodes_datafame' is not defined

<div class="alert alert-block alert-info">
<b>Note:</b> inside the init phase of `STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance` class, K genes with the highest variance are chosen from matrix_dataframe, and they are the only genes that are kept for training and testing purposes
</div>

In [14]:
%%time

k = 10
custom_DS_KGenesWithHighestVariance = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafame_train, 
                                                                           num_of_dims_k=k)
custom_DS_KGenesWithHighestVariance_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafame_train, 
                                                                           num_of_dims_k=k)
custom_DS_KGenesWithHighestVariance_test = loadAndPreProcess.STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance(imageFolder=augmentedImageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_datafame=barcodes_datafame_test, 
                                                                           num_of_dims_k=k)


----- entering __init__ phase of  STDL_Dataset_KValuesPerImg_KGenesWithHighestVariance -----
calculate variance of all columns from  matrix_dataframe - and choosing K genes with higest variance ...


UnboundLocalError: local variable 'reduced_df' referenced before assignment

<div class="alert alert-block alert-info">
<b>Note:</b> inside the init phase of `STDL_Dataset_KValuesPerImg_LatentTensor_NMF` class, an NMF decompositionis performed on the matrix_dataframe object
</div>

In [15]:
%%time

k = 10
custom_DS_LatentTensor_NMF = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafame_train, 
                                                                           num_of_dims_k=k)
custom_DS_LatentTensor_NMF_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafame_train, 
                                                                           num_of_dims_k=k)
custom_DS_LatentTensor_NMF_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_NMF(imageFolder=augmentedImageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_datafame=barcodes_datafame_test, 
                                                                           num_of_dims_k=k)


----- entering __init__ phase of  STDL_Dataset_KValuesPerImg_LatentTensor_NMF -----
performing NMF decomposition on main matrix dataframe ...

----- finished __init__ phase of  STDL_Dataset_LatentTensor -----


----- entering __init__ phase of  STDL_Dataset_KValuesPerImg_LatentTensor_NMF -----
performing NMF decomposition on main matrix dataframe ...

----- finished __init__ phase of  STDL_Dataset_LatentTensor -----



NameError: name 'augmentedImageFolder_test' is not defined

<div class="alert alert-block alert-info">
<b>Note:</b> inside the init phase of `custom_DS_LatentTensor_AE` class, an Autoencoder network is being trained.
</div>

In [16]:
%%time

k = 10
custom_DS_LatentTensor_AE = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=imageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafam_traine, 
                                                                           num_of_dims_k=k,
                                                                           device=device)
custom_DS_LatentTensor_AE_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=augmentedImageFolder_train, 
                                                                           matrix_dataframe=matrix_dataframe_train, 
                                                                           features_dataframe=features_dataframe_train, 
                                                                           barcodes_datafame=barcodes_datafame_train, 
                                                                           num_of_dims_k=k,
                                                                           device=device)
custom_DS_LatentTensor_AE_augmented = loadAndPreProcess.STDL_Dataset_KValuesPerImg_LatentTensor_AutoEncoder(imageFolder=augmentedImageFolder_test, 
                                                                           matrix_dataframe=matrix_dataframe_test, 
                                                                           features_dataframe=features_dataframe_test, 
                                                                           barcodes_datafame=barcodes_datafame_test, 
                                                                           num_of_dims_k=k,
                                                                           device=device)

NameError: name 'barcodes_datafam_traine' is not defined

### 1.8: prepare for the next phases in which the experiments are executed

> import `executionModule` which contains the experiments, training methods, and testing methods

> create `hyperparameters` dictionary which will contain all of the hyper-parameters for our experiments (note - user can change these later)

> create `experiment_model_list` that will hold all the experiment methods in which the stated model is used for the experiment's execution

> model_list

> assisting function

>

<div class="alert alert-block alert-warning">
<b>Warning:</b> change the hyper-parameters below with caution if needed !
</div>

In [17]:
%%time

import executionModule

# define hyperparameters for the experiments
hyperparameters = dict()
hyperparameters['batch_size'] = 25
hyperparameters['max_alowed_number_of_batches'] = 99999
hyperparameters['precent_of_dataset_allocated_for_training'] = 0.8
hyperparameters['learning_rate'] = 1e-4
hyperparameters['momentum'] = 0.9
hyperparameters['num_of_epochs'] = 3
hyperparameters['channels'] = [32] 
hyperparameters['num_of_convolution_layers'] = len(hyperparameters['channels'])
hyperparameters['hidden_dims'] = [100]
hyperparameters['num_of_hidden_layers'] = len(hyperparameters['hidden_dims'])
hyperparameters['pool_every'] = 99999

# TODO: not needed anymore ?
experiment_model_list = []
experiment_model_list.append(executionModule.runExperimentWithModel_BasicConvNet)
experiment_model_list.append(executionModule.runExperimentWithModel_DenseNet121)
# experiment_model_list.append(executionModule.runExperimentWithModel_InceptionV3) #TODO: still need to sort out input image size... !

model_list = []
model_list.append('BasicConvNet')
model_list.append('DensetNet121')
#model_list.append('Inception_V3')  #TODO: still need to sort out input image size... !



AttributeError: module 'executionModule' has no attribute 'runExperimentWithModel_BasicConvNet'

<div class="alert alert-block alert-danger">
<b>Note:</b> inception_v3 model isnt sorted out yet... gives
    RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (5 x 5). Kernel size can't be greater than actual input size
</div>

In [18]:
def experiment_loop(ds_train,ds_test,phase_name):
    for model_name in experiment_model_list:
    print(f'\nstarting experiment **{model_name}**\n')
    %time executionModule.runExperiment(ds_train=custom_DS_SingleValuePerImg,
                                        ds_test=, 
                                        hyperparams=hyperparameters,
                                        device=device, 
                                        model_name=model_name, 
                                        dataset_name=phase_name)
    print(f'\nfinished experiment {model_name}')
    

IndentationError: expected an indented block (<ipython-input-18-19b04bd073ae>, line 3)

## Phase 2: Single Gene Prediction

In [None]:
experiment_loop(ds_train=custom_DS_SingleValuePerImg ,ds_test= ,phase_name='single_gene')

In [None]:
experiment_loop(ds_train=custom_DS_SingleValuePerImg_augmented ,ds_test= ,phase_name='single_gene')

## Phase 3: K genes prediction

In [None]:
experiment_loop(ds_train=custom_DS_KGenesWithHighestVariance ,ds_test= ,phase_name='k_genes')

In [None]:
experiment_loop(ds_train=custom_DS_KGenesWithHighestVariance_augmented ,ds_test= ,phase_name='k_genes')

## Phase 4: All genes prediction - using dimensionality reduction techniques

### 4.1: Prediction using dimensionality reduction technique NMF

In [None]:
experiment_loop(ds_train=custom_DS_LatentTensor_NMF ,ds_test= ,phase_name='NMF')

In [None]:
experiment_loop(ds_train=custom_DS_LatentTensor_NMF_augmented ,ds_test= ,phase_name='NMF')

### 4.1: Prediction using dimensionality reduction technique AE

In [None]:
experiment_loop(ds_train=custom_DS_LatentTensor_AE ,ds_test= ,phase_name='AE')

In [None]:
experiment_loop(ds_train=custom_DS_LatentTensor_AE_augmented ,ds_test= ,phase_name='AE')

<div class="alert alert-block alert-danger">
<b>Note:</b> not tested yet
</div>