# Flair Transcript Quantification Results Analysis Part 1

In this notebook, I imported GTEx V9 RNA-seq transcript quantification data from the open-access GTEx database. The quantification step was done through FLAIR. This notebook contains code for data preprocessing, including the creation of two dataframes: novel transcript quantification data and annotated transcript quantification data.

## Part 1: Import Data and Configure Python Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec
%matplotlib inline
import seaborn as sns
import re
from IPython.display import display
from matplotlib.pyplot import gcf
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
from PIL import ImageColor
from matplotlib.patches import Patch #for custom legend making
import scipy.spatial as sp, scipy.cluster.hierarchy as hc #for faster computing of hierarchial clusters

In [41]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
#pd.options.display.max_columns = None #display all columns in dataframe

In [3]:
#pd.options.display.max_colwidth = 100 #show the full content of long strings

### Import Data

In [4]:
os.getcwd()

'C:\\Users\\15082\\OneDrive\\Desktop\\thesis_research\\gtex_v9_data_analysis\\FLAIR'

In [5]:
data_dir = 'gtex_v9_data\\data_for_analysis\\gtex_database_data'

In [6]:
flair_quant_results_file_path = os.path.join(data_dir, 'glinos_flair_quant_tpm_results.csv')

#### Normalized Flair Transcript Quantification Data (in TPM)

In [7]:
# change working directory
os.chdir('C:\\Users\\15082\\OneDrive\\Desktop\\thesis_research')

In [8]:
flair_quant_data = pd.read_csv(flair_quant_results_file_path)

In [9]:
flair_quant_data.head(5)

Unnamed: 0,transcript,GTEX-1192X-0011-R10a-SM-4RXXZ,GTEX-11H98-0011-R11b-SM-4SFLZ,GTEX-11TTK-0011-R7b-SM-4TVFS,GTEX-1211K-0826-SM-7LDFQ,GTEX-1313W-0011-R7b-SM-4ZL3U,GTEX-13QBU-0426-SM-5A4VT,GTEX-13QJ3-0726-SM-7LDHS,GTEX-13QJ3-0726-SM-7LDHS_rep,GTEX-13RTJ-0011-R7b-SM-5CTCB,...,GTEX-QV44-0008-SM-3QNG7_ctrl2,GTEX-QV44-0008-SM-3QNG7_exp,GTEX-S4Z8-0008-SM-2Y983_ctrl,GTEX-S4Z8-0008-SM-2Y983_exp1,GTEX-S4Z8-0008-SM-2Y983_exp2,GTEX-S95S-0008-SM-3RQ8B_ctrl,GTEX-S95S-0008-SM-3RQ8B_exp1,GTEX-S95S-0008-SM-3RQ8B_exp2,GTEX-WY7C-0008-SM-3NZB5_ctrl,GTEX-WY7C-0008-SM-3NZB5_exp
0,000187c4-a488-40f0-a69c-0a89582f3241_ENSG00000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.334328,0.358385,0.420881,0.97198,0.0,0.0,0.0,0.262969,0.0,0.397173
1,00026598-3078-4e2f-8ac9-dd8f523396b9_ENSG00000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.358385,0.0,0.0,0.278141,0.294667,0.0,0.262969,0.0,0.397173
2,0002a5e2-f01a-4690-a7db-7af726712a5e_ENSG00000...,0.380702,5.359084,1.961873,4.590649,1.793796,7.308494,6.346269,2.378808,0.653233,...,8.692515,12.185087,7.154979,12.635743,10.013078,14.144038,13.873059,13.937337,1.727199,11.518015
3,000339f1-1769-4608-b369-59aa222cd7b7_ENSG00000...,0.0,0.297727,0.0,1.020144,0.0,0.0,0.0,0.0,1.959698,...,0.668655,0.358385,0.420881,0.97198,1.112564,0.294667,0.867066,0.525937,0.4318,1.191519
4,0003706a-94a7-4419-a61d-6310d7a9c10c_ENSG00000...,14.847375,18.459067,17.656857,4.080577,10.762776,13.288171,15.577205,11.418279,11.104956,...,5.683568,7.167698,6.313216,10.691783,6.397244,7.366687,6.936529,9.203902,9.067796,7.149113


In [10]:
flair_quant_data.shape

(93630, 93)

#### Complete Sample Information (Supplementary Table 1)

Samples sequencing metadata

In [11]:
sample_info_path = os.path.join(data_dir, 'sample_info_complete.csv')

In [12]:
sample_info = pd.read_csv(sample_info_path)

In [13]:
sample_info.head(5)

Unnamed: 0,sample_id,date_of_sequencing,sample_name,tissue,protocol,mrna_rin,flush_buffer,amount_loaded_ng,run_time,total_reads,median_read_length,median_read_quality,aligned_reads,median_read_length_align,median_read_quality_aligned,WGS,data_center,RNA_extraction_method,3_prime_bias_median,3_prime_bias_sd
0,LV1681,53119,CVD-LV1681,Heart - Left Ventricle,cDNA-PCR,,PBT,60.0,48.0,2287307,195,9.9,620717,696,10.9,No,BROAD,RNA Extraction from Paxgene-derived Lysate Pla...,0.653,0.378
1,LV1702,53119,CVD-LV1702,Heart - Left Ventricle,cDNA-PCR,,PBT,60.0,48.0,4456040,211,10.3,1517665,737,11.5,No,BROAD,RNA Extraction from Paxgene-derived Lysate Pla...,0.754,0.357
2,LV1708,60319,CVD-LV1708,Heart - Left Ventricle,cDNA-PCR,,PBT,60.0,48.0,2586875,261,10.5,1117070,699,11.2,No,BROAD,RNA Extraction from Paxgene-derived Lysate Pla...,0.659,0.382
3,LV1723,60319,CVD-LV1723,Heart - Left Ventricle,cDNA-PCR,,PBT,60.0,48.0,3577244,230,10.5,1017015,666,11.5,No,BROAD,RNA Extraction from Paxgene-derived Lysate Pla...,0.57,0.399
4,GTEX-1192X-0011-R10a-SM-4RXXZ,52219,GTEX-1192X,Brain - Frontal Cortex (BA9),cDNA-PCR,8.7,PBT,60.0,48.0,7568902,651,11.4,5593813,750,11.8,Yes,BROAD,RNA isolation_PAXgene Tissue miRNA,0.782,0.348


In [14]:
sample_info.shape

(96, 20)

### Overview of datasets

In [15]:
flair_quant_data.describe(include='object')

Unnamed: 0,transcript
count,93630
unique,93630
top,000187c4-a488-40f0-a69c-0a89582f3241_ENSG00000...
freq,1


In [16]:
sample_info.describe(include='object')

Unnamed: 0,sample_id,sample_name,tissue,protocol,flush_buffer,median_read_length,median_read_length_align,WGS,data_center,RNA_extraction_method
count,96,96,96,96,96,96,96,96,96,96
unique,96,61,15,2,2,87,83,2,2,3
top,LV1681,GTEX-WY7C,Cells - Cultured fibroblasts,cDNA-PCR,PBT,797,764,Yes,BROAD,RNA Extraction from Paxgene-derived Lysate Pla...
freq,1,6,22,94,63,2,3,83,83,35


In [17]:
sample_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   sample_id                    96 non-null     object 
 1   date_of_sequencing           96 non-null     int64  
 2   sample_name                  96 non-null     object 
 3   tissue                       96 non-null     object 
 4   protocol                     96 non-null     object 
 5   mrna_rin                     92 non-null     float64
 6   flush_buffer                 96 non-null     object 
 7   amount_loaded_ng             96 non-null     float64
 8   run_time                     92 non-null     float64
 9    total_reads                 96 non-null     int64  
 10  median_read_length           96 non-null     object 
 11  median_read_quality          96 non-null     float64
 12  aligned_reads                96 non-null     int64  
 13  median_read_length_ali

In [18]:
len(pd.unique(sample_info['sample_id']))

96

In [19]:
print(sample_info.groupby('tissue').size())

tissue
Adipose - Subcutaneous                       1
Brain - Anterior cingulate cortex (BA24)     1
Brain - Caudate (basal ganglia)              1
Brain - Cerebellar Hemisphere                8
Brain - Frontal Cortex (BA9)                 6
Brain - Putamen (basal ganglia)              6
Breast - Mammary Tissue                      1
Cells - Cultured fibroblasts                22
Heart - Atrial Appendage                     9
Heart - Left Ventricle                      11
K562                                         4
Liver                                        8
Lung                                         8
Muscle - Skeletal                            9
Pancreas                                     1
dtype: int64


## Part 2: Data Cleaning and Manipulation

### Make Separate Dataframes for Novel and Annotated Transcripts

### (1) Dataframe of Novel Transcripts

In [20]:
flair_quant_data['transcript_type'] = flair_quant_data['transcript'].apply(lambda x: x[0])
flair_quant_data['novel_or_annot'] = flair_quant_data['transcript_type'].apply(lambda x: 'annot' if x=='E' else 'novel')

In [21]:
flair_quant_data.head(3)

Unnamed: 0,transcript,GTEX-1192X-0011-R10a-SM-4RXXZ,GTEX-11H98-0011-R11b-SM-4SFLZ,GTEX-11TTK-0011-R7b-SM-4TVFS,GTEX-1211K-0826-SM-7LDFQ,GTEX-1313W-0011-R7b-SM-4ZL3U,GTEX-13QBU-0426-SM-5A4VT,GTEX-13QJ3-0726-SM-7LDHS,GTEX-13QJ3-0726-SM-7LDHS_rep,GTEX-13RTJ-0011-R7b-SM-5CTCB,...,GTEX-S4Z8-0008-SM-2Y983_ctrl,GTEX-S4Z8-0008-SM-2Y983_exp1,GTEX-S4Z8-0008-SM-2Y983_exp2,GTEX-S95S-0008-SM-3RQ8B_ctrl,GTEX-S95S-0008-SM-3RQ8B_exp1,GTEX-S95S-0008-SM-3RQ8B_exp2,GTEX-WY7C-0008-SM-3NZB5_ctrl,GTEX-WY7C-0008-SM-3NZB5_exp,transcript_type,novel_or_annot
0,000187c4-a488-40f0-a69c-0a89582f3241_ENSG00000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.420881,0.97198,0.0,0.0,0.0,0.262969,0.0,0.397173,0,novel
1,00026598-3078-4e2f-8ac9-dd8f523396b9_ENSG00000...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.278141,0.294667,0.0,0.262969,0.0,0.397173,0,novel
2,0002a5e2-f01a-4690-a7db-7af726712a5e_ENSG00000...,0.380702,5.359084,1.961873,4.590649,1.793796,7.308494,6.346269,2.378808,0.653233,...,7.154979,12.635743,10.013078,14.144038,13.873059,13.937337,1.727199,11.518015,0,novel


In [22]:
flair_quant_data['transcript'].groupby(flair_quant_data['novel_or_annot']).count()

novel_or_annot
annot    21357
novel    72273
Name: transcript, dtype: int64

In [23]:
flair_novel_transcript_quant_data = flair_quant_data[flair_quant_data.novel_or_annot=='novel']

In [24]:
flair_novel_transcript_quant_data.shape

(72273, 95)

#### Transpose the dataframe

In [25]:
novel_transcript_quant_transposed = flair_novel_transcript_quant_data.transpose().reset_index()

#### Replace the header

In [26]:
novel_transcript_quant_header = novel_transcript_quant_transposed.iloc[0] #grab the first row for the header
novel_transcript_quant_transposed = novel_transcript_quant_transposed[1:] #remove first row from dataset
novel_transcript_quant_transposed.columns = novel_transcript_quant_header #set the new header row

#### Rename first column

In [27]:
novel_transcript_quant_transposed = novel_transcript_quant_transposed.rename(columns = {"transcript":"sample_id"})

**Drop last two rows**

In [28]:
novel_transcript_quant_transposed.drop(novel_transcript_quant_transposed.tail(2).index,inplace=True)

In [29]:
novel_transcript_quant_transposed.shape

(92, 72274)

**Save the dataframe**

In [30]:
#novel_transcript_quant_transposed.to_csv('gtex_v9_data\\data_for_analysis\\gtex_database_data\\novel_transcript_quant_nofilter_transposed.csv', sep=',')

### (2) Dataframe of Annotated Transcripts

In [31]:
flair_annotated_transcript_quant_data = flair_quant_data[flair_quant_data.novel_or_annot=='annot']

In [32]:
flair_annotated_transcript_quant_data.shape

(21357, 95)

#### Transpose the dataframe

In [33]:
annotated_transcript_quant_transposed = flair_annotated_transcript_quant_data.transpose().reset_index()

#### Replace the header

In [34]:
annotated_transcript_quant_header = annotated_transcript_quant_transposed.iloc[0] #grab the first row for the header
annotated_transcript_quant_transposed = annotated_transcript_quant_transposed[1:] #remove first row from dataset
annotated_transcript_quant_transposed.columns = annotated_transcript_quant_header #set the new header row
annotated_transcript_quant_transposed.shape

(94, 21358)

#### Rename first column

In [35]:
annotated_transcript_quant_transposed = annotated_transcript_quant_transposed.rename(columns = {"transcript":"sample_id"})
annotated_transcript_quant_transposed.tail(3)

Unnamed: 0,sample_id,ENST00000000233.9_ENSG00000004059.10,ENST00000000412.7_ENSG00000003056.7,ENST00000001008.5_ENSG00000004478.7,ENST00000001146.6_ENSG00000003137.8,ENST00000002125.8_ENSG00000003509.15,ENST00000002165.10_ENSG00000001036.13,ENST00000002501.10_ENSG00000003249.13,ENST00000002596.5_ENSG00000002587.9,ENST00000003100.12_ENSG00000001630.15,...,ENST00000640621.1_ENSG00000262633.2,ENST00000640621.1-1_ENSG00000262633.2,ENST00000640674.1_ENSG00000278175.3,ENST00000640752.1_ENSG00000138796.16,ENST00000640769.1_ENSG00000176225.13,ENST00000640799.1_ENSG00000143612.19,ENST00000640815.1_ENSG00000164199.16,ENST00000640876.1_ENSG00000197563.10,ENST00000640893.1_ENSG00000087258.14,ENST00000640967.1_ENSG00000082212.12
92,GTEX-WY7C-0008-SM-3NZB5_exp,136.230313,110.811246,7.943458,0.397173,3.177383,103.26496,0.0,0.0,18.667127,...,18.667127,1.985865,0.0,0.0,5.957594,0.397173,0.0,0.794346,0.0,0.0
93,transcript_type,E,E,E,E,E,E,E,E,E,...,E,E,E,E,E,E,E,E,E,E
94,novel_or_annot,annot,annot,annot,annot,annot,annot,annot,annot,annot,...,annot,annot,annot,annot,annot,annot,annot,annot,annot,annot


**Drop last two rows**

In [36]:
annotated_transcript_quant_transposed.drop(annotated_transcript_quant_transposed.tail(2).index,inplace=True)

In [37]:
annotated_transcript_quant_transposed.shape

(92, 21358)

**Save the dataframe**

In [38]:
#annotated_transcript_quant_transposed.to_csv('gtex_v9_data\\data_for_analysis\\gtex_database_data\\annotated_transcript_quant_nofilter_transposed.csv', sep=',')