# Data preperation: GTEx V8

This file is part of the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.  If not, see <http://www.gnu.org/licenses/>.

### Objective:
> Load files from raw data folder, filter genes and merge with labels before saving file interim data folder

### Input files:
1. *GTEx_v8_TMM_values_protein_coding_filtered.tsv*
2. *filtered_genes.pkl*
3. *GTEx_v8_metadata_filtered.tsv*

### Output files:
1. *gtex_filtered_tmm_intersect.pkl*  
 
### Table of contents:
1. [Import Modules](#1.-Import-Modules)  
2. [Set static paths](#2.-Set-static-paths)  
3. [Load files](#3.-Load-files)  
    3.1 [Load RNAseq](#3.1-Load-RNAseq)  
    3.2 [Load gene list](#3.2-Load-gene-list)  
    3.3 [Load labels](#3.3-Load-labels)  
4. [Process data](#4.-Process-data)  
    4.1 [Reshape dataframe](#4.1-Reshape-dataframe)  
    4.2 [Filter genes](#4.2-Filter-genes)  
    4.3 [Add labels](#4.3-Add-labels)
5. [Save outputs](#5.-Save-outputs) 

## 1. Import Modules

In [None]:
import pandas as pd
import os
import pickle

In [None]:
# Specify max number of rows and columns to be displayed in dataframes
pd.options.display.max_rows = 1999
pd.options.display.max_columns = 1999

# Display full output in notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 2. Set static paths

In [None]:
local_dir = '../data/raw/'
gene_dir = "../data/gene_lists/"
raw_dir = "../data/raw/"
interim_dir = '../data/interim/'

## 3. Load files

#### 3.1 Load RNAseq

In [None]:
%%time

key = 'GTEx_v8_TMM_values_protein_coding_filtered.tsv'
gtex_tmm_filter_raw = pd.read_csv(os.path.join(local_dir, key), sep='\t')

#### 3.2 Load gene list

In [None]:
key = 'filtered_genes.pkl'
with open(os.path.join(gene_dir, key),"rb") as f:
    tmm_tpm_intersect = pickle.load(f)

#### 3.3 Load labels

In [None]:
%%time
key = 'GTEx_v8_metadata_filtered.tsv'
GTEx_v8_metadata_filtered = pd.read_csv(os.path.join(raw_dir, key), sep='\t')
GTEx_v8_metadata_filtered.shape
GTEx_v8_metadata_filtered.head()

## 4. Process data

#### 4.1 Reshape dataframe

In [None]:
gtex_tmm_filter_intersect = gtex_tmm_filter_raw.set_index('Associated.Gene.Name')
gtex_tmm_filter_intersect.index.name = None
gtex_tmm_filter_intersect = gtex_tmm_filter_intersect.drop(columns=['Gene.Name','Gene.Biotype','Chromosome.Name','Gene.Start.bp','Gene.End.bp','Strand'])
gtex_tmm_filter_intersect = gtex_tmm_filter_intersect.T

#### 4.2 Filter genes

In [None]:
gtex_tmm_filter_intersect = gtex_tmm_filter_intersect.loc[:,tmm_tpm_intersect]

#### 4.3 Add labels

In [None]:
GTEx_v8_metadata_filtered['Sample.ID'].tolist() == gtex_tmm_filter_intersect.index.tolist()

In [None]:
gtex_tmm_filter_intersect['type'] = GTEx_v8_metadata_filtered['Sample.Type.Specific'].tolist()

In [None]:
# Check that labels have been attached to main df
gtex_tmm_filter_intersect['type'].value_counts().sort_index()

## 5. Save outputs

In [None]:
key = 'gtex_filtered_tmm_intersect.pkl'

pickle.dump(gtex_tmm_filter_intersect, open(os.path.join(local_dir, key),"wb"), protocol=4)