## SMOTE

This file is part of the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.  If not, see <http://www.gnu.org/licenses/>.


### Objective:
> Apply SMOTE to GTEx dataset and perform oversampling

### Input files:
1. *gtex_filtered_tmm_intersect.pkl*


### Output files:
1. *gtex_filtered_tmm_intersect_smote.pkl* 
2. *gtex_filtered_tmm_intersect_test.pkl* 
3. *gtex_filtered_tmm_intersect_imbalanced.pkl* 
 
### Table of contents:
1. [Import Modules](#1.-Import-Modules)  
2. [Set static paths](#2.-Set-static-paths)  
3. [Load files](#3.-Load-files)  
    3.1 [Load RNAseq](#3.1-Load-RNAseq)  
    3.2 [Load gene list](#3.2-Load-gene-list)  
    3.3 [Load labels](#3.3-Load-labels)  
4. [Train/test split](#4.-Train/test-split)  
5. [Generate balanced dataset](#4.-Generate-balanced-dataset)  
    5.1 [Oversampling via SMOTE](#5.1-Oversampling-via-SMOTE)  
    5.2 [Load gene list](#3.2-Load-gene-list)  
6. [Save outputs](#5.-Save-outputs) 

## 1. Import Modules

In [None]:
import pickle
import os
import pandas as pd
import smote_variants as sv

In [None]:
# Specify max number of rows and columns to be displayed in dataframes
pd.options.display.max_rows = 1999
pd.options.display.max_columns = 1999

# Display full output in notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 2. Set static paths

In [None]:
interim_dir = '../data/interim/'
proc_dir = '../data/processed/'

## 3. Load files

#### 3.1 Load RNAseq

In [None]:
%%time
key = 'gtex_filtered_tmm_intersect.pkl'
gtex_tmm = pickle.load(open(os.path.join(interim_dir, key), "rb"))

## 4. Train/test split

In [None]:
tissue_type_list = sorted(gtex_tmm['type'].unique().tolist())

In [None]:
%%time

# Construct test dataset matrix with exactly 50 samples per class
df_test_data = pd.DataFrame()
for tissue_type in tissue_type_list:
    df_one_class = gtex_tmm[gtex_tmm['type'] == tissue_type].sample(n=50, random_state=42)
    df_test_data = pd.concat([df_test_data, df_one_class])
    
# Construct train datast matrix with the remaining samples
df_train_data = gtex_tmm.drop(df_test_data.index.tolist())

In [None]:
gtex_tmm_class_size_test_50 = pd.DataFrame()
gtex_tmm_class_size_test_50['full_sample_size'] = gtex_tmm['type'].value_counts().sort_index()
gtex_tmm_class_size_test_50['test'] = df_test_data['type'].value_counts().sort_index()
gtex_tmm_class_size_test_50['imbalance'] = df_train_data['type'].value_counts().sort_index()
gtex_tmm_class_size_test_50

## 5. Generate balanced dataset

In [None]:
# SMOTE accepts target as numerical values only
# Construct a dictionary of tissue type in string as keys and in numerical as values
label_keys = gtex_tmm['type'].value_counts().sort_index().index.tolist()
num_values = range(len(label_keys))
label_num_dict = dict(zip(label_keys, num_values))
label_num_dict

#### 5.2 Oversampling via SMOTE

In [None]:
X = df_train_data.iloc[:,:-1]
y = df_train_data.iloc[:,-1].map(label_num_dict)

In [None]:
%%time

# Select SMOTE variants
oversampler= sv.MulticlassOversampling(sv.SMOTE(n_jobs=2, random_state=42))

# X_samp and y_samp contain the oversampled dataset
X_samp, y_samp= oversampler.sample(X, y)

In [None]:
%%time

# Rebuild oversampled dataset back from a numpy array into a dataframe
df_smote = pd.DataFrame(data=X_samp, columns=df_train_data.columns[:-1])
df_smote = df_smote.rename(index=lambda s: 'sample_' + str(s))

# Reinsert labels as the last column
label_num_dict_inv = {v: k for k, v in label_num_dict.items()}
y_samp_str = []
for num in y_samp:
    y_samp_str.append(label_num_dict_inv[num])
df_smote['type'] = y_samp_str

In [None]:
gtex_tmm_class_size_test_50['smote'] = df_smote['type'].value_counts().sort_index()
gtex_tmm_class_size_test_50

## 6. Save outputs

In [None]:
key = 'gtex_filtered_tmm_intersect_smote.pkl'
pickle.dump(df_smote, open(os.path.join(proc_dir, key),"wb"), protocol=4)

In [None]:
key = 'gtex_filtered_tmm_intersect_test.pkl'
pickle.dump(df_test_data, open(os.path.join(proc_dir, key),"wb"), protocol=4)

In [None]:
key = 'gtex_filtered_tmm_intersect_imbalanced.pkl'
pickle.dump(df_train_data, open(os.path.join(proc_dir, key),"wb"), protocol=4)