# Data from GTEx
Link: https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression


**Bulk-Tissue-Expression:**

Contains the gene expression in *healthy* tissue samples from adults. For each gene, the transcription value for each tissue is listed.

**Output file format:**
* id
* sum
* count
* tpm

## Load Data and melt down to long format
→ takes 20 minutes with 3000 rows per chunk

The format of the input file is a column per tissue.
We load the data in chunks and melt it down to a long format with columns for the gene name, tissue and TPM value.


In [1]:
import pandas as pd
import gc
import time

In [4]:
# Function to analyze the dataset
def dataset_analysis(df):
    global missing_values, min_tpm, max_tpm
    
    missing_values =+ df.isnull().sum()
     
    if df['tpm'].min() < min_tpm:
        min_tpm = df['tpm'].min()
        
    if df['tpm'].max() > max_tpm:
        max_tpm = df['tpm'].max()

NOTE: 

2000 chunks = 40 sec → 1000 chunk = 20 sec

3000 chunk = 60 sec → 1000 chunk = 20 sec

4000 chunk = 85 sec → 1000 chunk = 21 sec

In [5]:
file = "../import_data/GTEx/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct"
output_file = "../processed_data/GTEX_healthy.csv"


# Initialize an output file and overwrite if already exists
with open(output_file, 'w') as f_out:
    f_out.write("id,tissue,tpm\n")

missing_values = 0
min_tpm = float('+inf')
max_tpm = float('-inf')

chunksize = 2000
save_threshold = 10000

output_chunks = []
outputrows = 0

with pd.read_csv(file, delimiter="\t", skiprows=2, chunksize=chunksize) as reader:
    counter = 0

    for df_chunk in reader:
        start_time = time.time()
        df_list = []
        
        # cleanup
        df_chunk = df_chunk.drop(columns=['Description'])
        df_chunk.rename(columns={'Name':'id'}, inplace=True)
        df_chunk['id'] = df_chunk['id'].str.split('.').str[0]
        
        df_long = pd.melt(df_chunk, id_vars=['id'], var_name='tissue', value_name='tpm', ignore_index=True)


        df_long["tpm"] = df_long["tpm"].astype(float)
        
        output_chunks.append(df_long)
        outputrows += df_long.shape[0]
        
        dataset_analysis(df_long)
        
        if outputrows >= save_threshold:
            pd.concat(output_chunks).to_csv(output_file, mode='a', header=False, index=False)
            output_chunks.clear()
        
        # Clear memory
        del df_long
        
        end_time = time.time()
        duration = end_time - start_time        
        print(f"{counter} / 56200 processed in {duration:.2f} seconds")
        
        counter += chunksize
        
if output_chunks:
    pd.concat(output_chunks).to_csv(output_file, mode='a', header=False, index=False)
        
print(f"Output file contains {outputrows} rows")

0 / 56200 processed in 39.85 seconds
2000 / 56200 processed in 39.32 seconds
4000 / 56200 processed in 39.09 seconds
6000 / 56200 processed in 39.40 seconds
8000 / 56200 processed in 41.90 seconds
10000 / 56200 processed in 40.64 seconds
12000 / 56200 processed in 37.18 seconds
14000 / 56200 processed in 37.06 seconds
16000 / 56200 processed in 38.19 seconds
18000 / 56200 processed in 37.72 seconds
20000 / 56200 processed in 40.70 seconds
22000 / 56200 processed in 40.68 seconds
24000 / 56200 processed in 41.08 seconds
26000 / 56200 processed in 39.42 seconds
28000 / 56200 processed in 40.07 seconds
30000 / 56200 processed in 39.95 seconds
32000 / 56200 processed in 39.65 seconds
34000 / 56200 processed in 39.18 seconds
36000 / 56200 processed in 39.46 seconds
38000 / 56200 processed in 37.65 seconds
40000 / 56200 processed in 38.05 seconds
42000 / 56200 processed in 38.21 seconds
44000 / 56200 processed in 38.19 seconds
46000 / 56200 processed in 39.78 seconds
48000 / 56200 processed 

### Analyze the dataset

In [6]:
print(f"Missing values:\n"
      f"{missing_values}\n")
print(f"Min TPM: {min_tpm}")
print(f"Max TPM: {max_tpm}")

Missing values:
id        0
tissue    0
tpm       0
dtype: int64

Min TPM: 0.0
Max TPM: 747400.0


## Save grouped data

In [8]:
file = "../processed_data/GTEX_healthy.csv"
output_file = "../processed_data/GTEX_healthy_temp.csv"

chunksize = 200000000

# Initialize output file and overwrite if already exists
with open(output_file, 'w') as f_out:
    f_out.write("id,tmp sum,tmp count\n")
    
with (pd.read_csv(file, chunksize=chunksize) as reader):
    for df_chunk in reader:
        start_time = time.time()
        df_mean = df_chunk.drop(columns=["tissue"])
        df_mean = df_mean.groupby('id').agg(['sum','count'])
        df_mean.to_csv(output_file, mode='a', header=False, index=True)
        print(f"{df_mean.shape[0]} rows processed in {duration:.2f} seconds")
              
        del df_mean
        gc.collect()
        
        end_time = time.time()
        duration = end_time - start_time

12000 rows processed in 4.24 seconds
14000 rows processed in 21.50 seconds
14000 rows processed in 18.89 seconds
14000 rows processed in 18.15 seconds
10156 rows processed in 19.39 seconds


## Load Temp file to calculate mean TPM 

In [9]:
df_mean = pd.read_csv("../processed_data/GTEX_healthy_temp.csv")
df_mean = df_mean.groupby('id').sum()
df_mean['tpm'] = df_mean['tmp sum']/df_mean['tmp count']
df_mean

Unnamed: 0_level_0,tmp sum,tmp count,tpm
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000000003,274030.408120,17382,15.765183
ENSG00000000005,62036.191561,17382,3.568990
ENSG00000000419,841623.549500,17382,48.419258
ENSG00000000457,101256.444200,17382,5.825362
ENSG00000000460,41291.764070,17382,2.375547
...,...,...,...
ENSG00000284592,27.276690,17382,0.001569
ENSG00000284594,88.438900,17382,0.005088
ENSG00000284595,7720.422600,17382,0.444162
ENSG00000284596,158.655000,17382,0.009128


## Save final file

In [10]:
df_mean = df_mean.drop(columns=['tmp sum', 'tmp count'])
df_mean.to_csv("../processed_data/GTEX_healthy_mean.csv")
print(f'There are {df_mean.shape[0]} genes in the final file')

There are 56156 genes in the final file
