# Data from GTEx
Link: https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression


**Bulk-Tissue-Expression:**

Contains the gene expression in *healthy* tissue samples from adults. For each gene, the transcription value for each tissue is listed.

**Output file format:**
* id
* sum
* count
* tpm

## Load Data and melt down to long format
→ takes 20 minutes with 3000 rows per chunk

The format of the input file is a column per tissue.
We load the data in chunks and melt it down to a long format with columns for the gene name, tissue and TPM value.


In [3]:
import pandas as pd
import gc
import time

In [8]:
file = "../import_data/GTEx/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct"
output_file = "../processed_data/GTEX_healthy.csv"

chunksize = 3000
counter = 0
outputrows = 0

# Initialize output file and overwrite if already exists
with open(output_file, 'w') as f_out:
    f_out.write("Name,Tissue,TPM\n")

In [9]:
with pd.read_csv(file, delimiter="\t", skiprows=2, chunksize=chunksize) as reader:
    for df_chunk in reader:
        start_time = time.time() 
        df_melt = pd.melt(df_chunk, id_vars=['Name', 'Description'], var_name='Tissue', value_name='TPM', ignore_index=True)
        df_melt = df_melt.drop(columns=['Description'])
        
        df_melt["TPM"] = df_melt["TPM"].astype(float)
        outputrows += df_melt.shape[0]
        
        df_melt.to_csv(output_file, mode='a', header=False, index=False)
        
        # Clear memory
        del df_melt
        gc.collect()
        
        end_time = time.time()
        duration = end_time - start_time        
        print(f"{counter} / 56200 processed in {duration:.2f} seconds")
        
        counter += chunksize
        
print(f"Output file contains {outputrows} rows")

0 / 56200 processed in 57.51 seconds
3000 / 56200 processed in 56.86 seconds
6000 / 56200 processed in 56.90 seconds
9000 / 56200 processed in 57.67 seconds
12000 / 56200 processed in 55.18 seconds
15000 / 56200 processed in 56.85 seconds
18000 / 56200 processed in 56.51 seconds
21000 / 56200 processed in 56.90 seconds
24000 / 56200 processed in 56.48 seconds
27000 / 56200 processed in 58.43 seconds
30000 / 56200 processed in 56.59 seconds
33000 / 56200 processed in 57.32 seconds
36000 / 56200 processed in 56.87 seconds
39000 / 56200 processed in 59.52 seconds
42000 / 56200 processed in 58.75 seconds
45000 / 56200 processed in 60.45 seconds
48000 / 56200 processed in 57.53 seconds
51000 / 56200 processed in 57.14 seconds
54000 / 56200 processed in 41.80 seconds
Output file contains 976868400 rows


## Save grouped data
→ takes 6 Minutes with 10 ** 8 rows per chunk

In [26]:
file = "../processed_data/GTEX_healthy.csv"
output_file = "../processed_data/GTEX_healthy_temp.csv"

chunksize = 200000000

# Initialize output file and overwrite if already exists
with open(output_file, 'w') as f_out:
    f_out.write("Name,sum,count\n")
    
with (pd.read_csv(file, chunksize=chunksize) as reader):
    for df_chunk in reader:
        start_time = time.time()
        df_mean = df_chunk.drop(columns=["Tissue"])
        df_mean = df_mean.groupby('Name').agg(['sum','count'])
        df_mean.to_csv(output_file, mode='a', header=False, index=True)
        print(f"{df_mean.shape[0]} rows processed in {duration:.2f} seconds")
              
        del df_mean
        gc.collect()
        
        end_time = time.time()
        duration = end_time - start_time

12000 rows processed in 5.04 seconds
15000 rows processed in 15.53 seconds
15000 rows processed in 14.15 seconds
15000 rows processed in 14.34 seconds
11200 rows processed in 13.95 seconds


## Load Temp file to calculate mean TPM 

In [4]:
df_mean = pd.read_csv("../processed_data/GTEX_healthy_temp.csv")

Unnamed: 0,Name,sum,count
0,ENSG00000000457.13,1.012564e+05,17382
1,ENSG00000000460.16,4.129176e+04,17382
2,ENSG00000000938.12,7.962391e+05,17382
3,ENSG00000000971.15,1.488672e+06,17382
4,ENSG00000001460.17,5.044793e+04,17382
...,...,...,...
68195,ENSG00000284550.1,5.448620e+01,17382
68196,ENSG00000284553.1,5.785420e+01,11446
68197,ENSG00000284564.1,8.892744e+03,11445
68198,ENSG00000284574.1,3.219920e+03,11445


In [41]:
df_mean.rename(columns={'Name':'id'}, inplace=True)
df_mean['id'] = df_mean['id'].str.split('.').str[0]
df_mean = df_mean.groupby('id').sum()
df_mean

Unnamed: 0_level_0,sum,count
id,Unnamed: 1_level_1,Unnamed: 2_level_1
ENSG00000000003,274030.408120,17382
ENSG00000000005,62036.191561,17382
ENSG00000000419,841623.549500,17382
ENSG00000000457,101256.444200,17382
ENSG00000000460,41291.764070,17382
...,...,...
ENSG00000284592,27.276690,17382
ENSG00000284594,88.438900,17382
ENSG00000284595,7720.422600,17382
ENSG00000284596,158.655000,17382


In [42]:
df_mean['tpm'] = df_mean['sum']/df_mean['count']
df_mean = df_mean.drop(columns=['sum', 'count'])

Unnamed: 0_level_0,tpm
id,Unnamed: 1_level_1
ENSG00000000003,15.765183
ENSG00000000005,3.568990
ENSG00000000419,48.419258
ENSG00000000457,5.825362
ENSG00000000460,2.375547
...,...
ENSG00000284592,0.001569
ENSG00000284594,0.005088
ENSG00000284595,0.444162
ENSG00000284596,0.009128


## Save final file

In [44]:
df_mean.to_csv("../processed_data/GTEX_healthy_mean.csv")
print(f'There are {df_mean.shape[0]} genes in the final file')