# GTEx Expressed and Tested Genes

## Housekeeping

### Background

* This notebook uses the GTEx_Analysis_v7_eQTL_expression_matrices.tar.gz data files as an alternative to using the expression thresholds as some genes were included for testing that don't seem to meet the criteria ("```Genes were selected based on expression thresholds of >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% of samples.```")? e.g. ```ENSG00000259003``` and ```ENSG00000161583```
* Exports a list of genes tested for eQTLs in at least one tissue with tissue count and lists for each tissue

### Imports

In [1]:
import pandas as pd
import mysql.connector
from sqlalchemy import create_engine

In [2]:
engine = create_engine('mysql+mysqlconnector://jupyter:password@localhost:3306/gtex', echo=False)

### Functions

In [3]:
import re
def removeGeneIDVersions(text):
    return re.findall('(ENSG\d+)', text)[0]

### Input/Output files

* Input files
    * ```../datasets/GTEx_Analysis_v7_eQTL_expression_matrices/[TISSUE].v7.normalized_expression.bed.gz```
* Ouput files
    * ```../outputFiles/GTExV7/genesTestedWithNumberOfTissues.csv```
    * ```../outputFiles/GTExV7/GTExTestedGenes/[TISSUE].txt```

---

## Analysis

### Export list of genes with number of tissues that meet expression criteria for eQTL testing

#### Get list of tissues

In [4]:
tissues = pd.read_sql_query(
    'SELECT DISTINCT tissue FROM `v7`',
    engine,
    coerce_float=True
)['tissue'].tolist()
tissues

['Adipose - Subcutaneous',
 'Adipose - Visceral (Omentum)',
 'Adrenal Gland',
 'Artery - Aorta',
 'Artery - Coronary',
 'Artery - Tibial',
 'Brain - Amygdala',
 'Brain - Anterior cingulate cortex (BA24)',
 'Brain - Caudate (basal ganglia)',
 'Brain - Cerebellar Hemisphere',
 'Brain - Cerebellum',
 'Brain - Cortex',
 'Brain - Frontal Cortex (BA9)',
 'Brain - Hippocampus',
 'Brain - Hypothalamus',
 'Brain - Nucleus accumbens (basal ganglia)',
 'Brain - Putamen (basal ganglia)',
 'Brain - Spinal cord (cervical c-1)',
 'Brain - Substantia nigra',
 'Breast - Mammary Tissue',
 'Cells - EBV-transformed lymphocytes',
 'Cells - Transformed fibroblasts',
 'Colon - Sigmoid',
 'Colon - Transverse',
 'Esophagus - Gastroesophageal Junction',
 'Esophagus - Mucosa',
 'Esophagus - Muscularis',
 'Heart - Atrial Appendage',
 'Heart - Left Ventricle',
 'Liver',
 'Lung',
 'Minor Salivary Gland',
 'Muscle - Skeletal',
 'Nerve - Tibial',
 'Ovary',
 'Pancreas',
 'Pituitary',
 'Prostate',
 'Skin - Not Sun Expo

#### Tested genes in all tissues

##### Get a list of genes from each of the expression matrices and export list per tissue to `../outputFiles/GTExV7/GTExTestedGenes/`

In [None]:
!mkdir ../../outputFiles/GTExV7/GTExTestedGenes

In [5]:
testedGenes = pd.DataFrame()
for tissue in tissues:
    testedGenesInTissue = pd.read_csv('../../datasets/GTEx_Analysis_v7_eQTL_expression_matrices/'+tissue.replace(" - ", "_").replace(" ", "_").replace("(", "").replace(")", "")+'.v7.normalized_expression.bed.gz',
                sep='\t',
                dtype={0: 'str'})['gene_id']
    
    # save tissue-specific list
    testedGenesInTissue.apply(removeGeneIDVersions).to_csv('../../outputFiles/GTExV7/GTExTestedGenes/'+tissue+'.txt', index=False)
    
    # add to pooled list
    testedGenes = pd.concat([testedGenes, testedGenesInTissue])

##### Count number of times each gene is listed to get number of expressed and tested tissues

In [6]:
geneTissueExpressionCounts = testedGenes[0].value_counts().to_frame()
geneTissueExpressionCounts.rename(columns={0:'expressedTissues'}, inplace=True)
geneTissueExpressionCounts['Ensembl Gene ID'] = geneTissueExpressionCounts.index
geneTissueExpressionCounts.reset_index(inplace=True, drop=True)
geneTissueExpressionCounts['Ensembl Gene ID'] = geneTissueExpressionCounts['Ensembl Gene ID'].apply(removeGeneIDVersions)
geneTissueExpressionCounts

Unnamed: 0,expressedTissues,Ensembl Gene ID
0,48,ENSG00000272186
1,48,ENSG00000117748
2,48,ENSG00000130856
3,48,ENSG00000169446
4,48,ENSG00000013573
5,48,ENSG00000255513
6,48,ENSG00000230510
7,48,ENSG00000173175
8,48,ENSG00000185220
9,48,ENSG00000186019


##### Save counts to file

In [7]:
geneTissueExpressionCounts.to_csv('../../outputFiles/GTExV7/genesTestedWithNumberOfTissues.csv', index=False)

#### Merged tissues

Merge GTEx tissues such as `Adipose - Subcutaneous` and `Adipose - Visceral (Omentum)` into one tissue `Adipose` with a unique list of genes that are affected in either `Subcutaneous` or `Visceral`

In [8]:
tissuesMerged = [['Adipose', ['Adipose - Subcutaneous',
                              'Adipose - Visceral (Omentum)']],
 ['Adrenal Gland', ['Adrenal Gland']],
 ['Artery', ['Artery - Aorta',
             'Artery - Coronary',
             'Artery - Tibial']],
 ['Brain', ['Brain - Amygdala',
            'Brain - Anterior cingulate cortex (BA24)',
            'Brain - Caudate (basal ganglia)',
            'Brain - Cerebellar Hemisphere',
            'Brain - Cerebellum',
            'Brain - Cortex',
            'Brain - Frontal Cortex (BA9)',
            'Brain - Hippocampus',
            'Brain - Hypothalamus',
            'Brain - Nucleus accumbens (basal ganglia)',
            'Brain - Putamen (basal ganglia)',
            'Brain - Spinal cord (cervical c-1)',
            'Brain - Substantia nigra']],
 ['Breast - Mammary Tissue', ['Breast - Mammary Tissue']],
 ['Colon', ['Colon - Sigmoid',
            'Colon - Transverse']],
 ['Esophagus', ['Esophagus - Gastroesophageal Junction',
                'Esophagus - Mucosa',
                'Esophagus - Muscularis']],
 ['Heart', ['Heart - Atrial Appendage',
            'Heart - Left Ventricle']],
 ['Liver', ['Liver']],
 ['Lung', ['Lung']],
 ['Minor Salivary Gland', ['Minor Salivary Gland']],
 ['Muscle - Skeletal', ['Muscle - Skeletal']],
 ['Nerve - Tibial', ['Nerve - Tibial']],
 ['Ovary', ['Ovary']],
 ['Pancreas', ['Pancreas']],
 ['Pituitary', ['Pituitary']],
 ['Prostate', ['Prostate']],
 ['Skin', ['Skin - Not Sun Exposed (Suprapubic)',
           'Skin - Sun Exposed (Lower leg)']],
 ['Small Intestine - Terminal Ileum', ['Small Intestine - Terminal Ileum']],
 ['Spleen', ['Spleen']],
 ['Stomach', ['Stomach']],
 ['Testis', ['Testis']],
 ['Thyroid', ['Thyroid']],
 ['Uterus', ['Uterus']],
 ['Vagina', ['Vagina']],
 ['Whole Blood', ['Whole Blood']]]

In [None]:
!mkdir ../../outputFiles/GTExV7/GTExMergedTissuesTestedGenes

In [9]:
testedGenes = pd.DataFrame()
for merged in tissuesMerged:
    
    combined = []
    
    for tissue in merged[1]:
        df = pd.read_csv('../../datasets/GTEx_Analysis_v7_eQTL_expression_matrices/'+tissue.replace(" - ", "_").replace(" ", "_").replace("(", "").replace(")", "")+'.v7.normalized_expression.bed.gz',
                sep='\t',
                dtype={0: 'str'})['gene_id']
        combined.append(df)
        
    testedGenesInTissue = pd.concat(combined, axis=0, ignore_index=True)
    testedGenesInTissue = pd.Series(testedGenesInTissue.unique())
    
    # save tissue-specific list
    testedGenesInTissue.apply(removeGeneIDVersions).to_csv('../../outputFiles/GTExV7/GTExMergedTissuesTestedGenes/'+merged[0]+'.txt', index=False)
    
    # add to pooled list
    testedGenes = pd.concat([testedGenes, testedGenesInTissue])

##### Count number of times each gene is listed to get number of expressed and tested tissues

In [10]:
geneTissueExpressionCounts = testedGenes[0].value_counts().to_frame()
geneTissueExpressionCounts.rename(columns={0:'expressedTissues'}, inplace=True)
geneTissueExpressionCounts['Ensembl Gene ID'] = geneTissueExpressionCounts.index
geneTissueExpressionCounts.reset_index(inplace=True, drop=True)
geneTissueExpressionCounts['Ensembl Gene ID'] = geneTissueExpressionCounts['Ensembl Gene ID'].apply(removeGeneIDVersions)
geneTissueExpressionCounts

Unnamed: 0,expressedTissues,Ensembl Gene ID
0,26,ENSG00000170604
1,26,ENSG00000178458
2,26,ENSG00000141425
3,26,ENSG00000117115
4,26,ENSG00000105204
5,26,ENSG00000131115
6,26,ENSG00000170647
7,26,ENSG00000121390
8,26,ENSG00000197948
9,26,ENSG00000135404


##### Save counts to file

In [11]:
geneTissueExpressionCounts.to_csv('../../outputFiles/GTExV7/genesTestedWithNumberOfMergedTissues.csv', index=False)