In [1]:
import pandas as pd
import numpy as np
import scipy as sc

from statsmodels.sandbox.stats.multicomp import multipletests 

## Data: gene expression in white blood cells

We have data from the analysis of gene expression (level of activity of a gene) in white blood cells of children with severe therapy-resistant asthma, and healthy controls. Results provide insight into the molecular pathogenesis of  asthma. Data is collected with DNA microarrays: https://en.wikipedia.org/wiki/DNA_microarray

We would like to understand which genes have different average activity levels between healthy and diseased groups.

Data source: http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4896

In [2]:
expression = pd.read_csv('expression.csv')

In [3]:
expression.head()

Unnamed: 0,ID_REF,IDENTIFIER,Healthy control,Healthy control.1,Healthy control.2,Healthy control.3,Healthy control.4,Healthy control.5,Healthy control.6,Healthy control.7,...,Severe asthma.8,Severe asthma.9,Severe asthma.10,Severe asthma.11,Severe asthma.12,Severe asthma.13,Severe asthma.14,Severe asthma.15,Severe asthma.16,Gene title
0,7933640,A1CF,4.24505,4.44464,4.36671,4.33497,4.45717,4.37972,4.40154,4.40521,...,4.59426,4.56012,4.60672,4.43026,4.55835,4.25656,4.52559,4.42976,4.37098,APOBEC1 complementation factor
1,7960947,A2M,4.79868,4.76833,4.66414,4.80272,4.85779,4.77622,4.92634,4.88576,...,5.01994,5.15406,5.13639,4.99052,4.88454,4.68904,4.85329,5.05764,5.48561,alpha-2-macroglobulin
2,7953775,A2ML1,4.79161,5.12633,4.9386,4.74597,4.91789,4.74453,5.23725,4.83903,...,5.22915,5.11131,5.33257,5.05544,4.96971,4.96505,4.76993,5.04489,4.70002,alpha-2-macroglobulin-like 1
3,8076497,A4GALT,5.79783,5.93942,5.82935,5.91139,5.60195,5.68317,6.01254,5.84921,...,6.16251,6.21829,6.06835,5.80154,6.06533,5.79955,5.94065,6.08192,5.86693,"alpha 1,4-galactosyltransferase"
4,8090955,A4GNT,3.79685,4.00154,3.83103,3.91021,3.8193,3.86203,3.9702,3.84166,...,4.17925,3.99698,4.09657,4.00232,3.70843,3.7929,4.31364,3.94925,3.62481,"alpha-1,4-N-acetylglucosaminyltransferase"


In [4]:
print("Number of genes in the study:", expression.shape[0])
healthy = [col for col in expression if col.startswith('Healthy')]
print("Number of healthy controls:", len(healthy))
asthma = [col for col in expression if col.startswith('Severe')]
print("Number of children with severe asthma:", len(asthma))

Number of genes in the study: 21465
Number of healthy controls: 18
Number of children with severe asthma: 17


In [5]:
expression['mean_healthy'] = expression[healthy].apply(np.mean, axis=1)
expression['mean_asthma'] = expression[asthma].apply(np.mean, axis=1)

For each gene, we are going to compare average expression levels between healthy children and children with severe asthma using t test:

In [6]:
def compare_groups(row):
    p = sc.stats.ttest_ind(row[healthy], row[asthma], equal_var = False).pvalue
    return p

expression['p'] = expression.apply(compare_groups, axis=1)

How many genes have significantly different average expression levels – without accounting for multiple hypotheses testing?

In [7]:
print('Number of significant differences, no correction for multiplicity:', sum(expression.p <= 0.05))

Number of significant differences, no correction for multiplicity: 2772


## Holm's method

Let's correct for multiple hypothesis testing using Holm's method to control FWER:

In [8]:
_, expression['p_adjusted_holm'], _, _ = multipletests(expression.p, alpha = 0.05, method = 'holm') 
print('Number of significant differences with FWER <= 0.05:', sum(expression.p_adjusted_holm <= 0.05))
expression[expression.p_adjusted_holm <= 0.05][['IDENTIFIER', 'Gene title', 'mean_healthy', 'mean_asthma', 
                                                'p', 'p_adjusted_holm']]

Number of significant differences with FWER <= 0.05: 9


Unnamed: 0,IDENTIFIER,Gene title,mean_healthy,mean_asthma,p,p_adjusted_holm
431,AGPAT4-IT1,AGPAT4 intronic transcript 1 (non-protein coding),5.591794,5.906907,5.504217e-07,0.011812
2865,CD4,CD4 molecule,10.332761,9.994222,2.169653e-06,0.046554
7128,GPR21,G protein-coupled receptor 21,5.907257,6.853551,3.304167e-08,0.000709
7144,GPR52,G protein-coupled receptor 52,6.284526,7.202818,1.456452e-08,0.000313
10186,LPP,LIM domain containing preferred translocation ...,8.943248,9.276649,1.816116e-07,0.003897
12641,OCR1,ovarian cancer-related protein 1,7.043893,8.379184,9.935442e-08,0.002132
17245,SND1-IT1,SND1 intronic transcript 1 (non-protein coding),5.543953,6.252731,3.057326e-08,0.000656
18225,SYNE2,"spectrin repeat containing, nuclear envelope 2",8.603884,9.020421,1.323281e-06,0.028395
20763,ZEB2,zinc finger E-box binding homeobox 2,6.341786,7.358729,3.958029e-08,0.000849


Just for comparison – here's what we would get using Bonferroni's correction:

In [9]:
_, expression['p_adjusted_bonf'], _, _ = multipletests(expression.p, alpha = 0.05, method = 'bonferroni') 
expression[expression.p_adjusted_bonf <= 0.05][['IDENTIFIER', 'Gene title', 'mean_healthy', 'mean_asthma', 
                                                'p', 'p_adjusted_bonf']]

Unnamed: 0,IDENTIFIER,Gene title,mean_healthy,mean_asthma,p,p_adjusted_bonf
431,AGPAT4-IT1,AGPAT4 intronic transcript 1 (non-protein coding),5.591794,5.906907,5.504217e-07,0.011815
2865,CD4,CD4 molecule,10.332761,9.994222,2.169653e-06,0.046572
7128,GPR21,G protein-coupled receptor 21,5.907257,6.853551,3.304167e-08,0.000709
7144,GPR52,G protein-coupled receptor 52,6.284526,7.202818,1.456452e-08,0.000313
10186,LPP,LIM domain containing preferred translocation ...,8.943248,9.276649,1.816116e-07,0.003898
12641,OCR1,ovarian cancer-related protein 1,7.043893,8.379184,9.935442e-08,0.002133
17245,SND1-IT1,SND1 intronic transcript 1 (non-protein coding),5.543953,6.252731,3.057326e-08,0.000656
18225,SYNE2,"spectrin repeat containing, nuclear envelope 2",8.603884,9.020421,1.323281e-06,0.028404
20763,ZEB2,zinc finger E-box binding homeobox 2,6.341786,7.358729,3.958029e-08,0.00085


It's the same 9 genes! We test so many hypotheses here that even the more powerful Holm's method is not able to reject more than Bonferroni's – controlling FWER is probably too strict.

## Benjamini-Hochberg's method
DNA microarrays are an exploratory tool – they are used to generate scientific hypotheses that later could be tested with more precise instruments. It might make sense to allow some type I errors to be able to have higher power.

Let's correct for multiple hypothesis testing using Benjamini-Hochberg's method to control FDR (as they usually do in microarray analysis):

In [10]:
_, expression['p_adjusted_bh'], _, _ = multipletests(expression.p, alpha = 0.05, method = 'fdr_bh') 
print('Number of significant differences with FDR <= 0.05:', sum(expression.p_adjusted_bh <= 0.05))
with pd.option_context('display.max_rows', None):
    display(expression[expression.p_adjusted_bh <= 0.05][['IDENTIFIER', 'Gene title', 'mean_healthy', 
                                                          'mean_asthma', 'p', 'p_adjusted_bh']])

Number of significant differences with FDR <= 0.05: 168


Unnamed: 0,IDENTIFIER,Gene title,mean_healthy,mean_asthma,p,p_adjusted_bh
18,AAR2,AAR2 splicing factor homolog (S. cerevisiae),8.176038,7.995185,0.0002287793,0.041617
135,ACBD3,acyl-CoA binding domain containing 3,9.503262,9.24992,6.883657e-06,0.008859
261,ADAM20,ADAM metallopeptidase domain 20,5.073112,5.413564,0.0002536561,0.043725
315,ADCK2,aarF domain containing kinase 2,6.407213,6.209225,0.0003493704,0.048863
431,AGPAT4-IT1,AGPAT4 intronic transcript 1 (non-protein coding),5.591794,5.906907,5.504217e-07,0.001688
582,ALKBH5,"AlkB family member 5, RNA demethylase",9.254869,9.035186,3.88782e-05,0.018019
662,ANGPTL1,angiopoietin-like 1,3.591168,3.921505,0.0001794481,0.039998
738,ANKRD36,ankyrin repeat domain 36///ankyrin repeat doma...,8.897997,9.319184,3.227949e-05,0.016113
777,ANP32A-IT1,ANP32A intronic transcript 1 (non-protein coding),6.779091,7.190945,0.0002939759,0.045086
854,APH1A,APH1A gamma secretase subunit,8.620867,8.342951,2.096931e-05,0.01439


Much better! This list of genes likely contains a small proportion of false findings, but by allowing that we were able to find more differences, many of which might be interesting.