## Detection of statistically significant differences in gene expression levels of cancer patients

By default $\alpha$ = 0.05

Info about data:

There're 72 people:
       
       24 without breast cancer (diagnosis is normal)
       25 disease was diagnosed in an early phase  breast cancer (diagnosis is early neoplasia)
       23 with manifest symptoms of breast cancer (diagnosis is cancer)
 
And various genes in number of 15748
       
**Task:** detect genes which activities is different significantly for people in different phases of disease

In [1]:
import pandas as pd
import numpy as np

from scipy import stats

In [30]:
CNT_GENES = 15748
ALPHA = 0.05

In [24]:
df = pd.read_csv('gene_high_throughput_sequencing.csv')
df.head()

Unnamed: 0,Patient_id,Diagnosis,LOC643837,LOC100130417,SAMD11,NOC2L,KLHL17,PLEKHN1,C1orf170,HES4,...,CLIC2,RPS4Y1,ZFY,PRKY,USP9Y,DDX3Y,CD24,CYorf15B,KDM5D,EIF1AY
0,STT5425_Breast_001_normal,normal,1.257614,2.408148,13.368622,9.494779,20.880435,12.722017,9.494779,54.349694,...,4.76125,1.257614,1.257614,1.257614,1.257614,1.257614,23.268694,1.257614,1.257614,1.257614
1,STT5427_Breast_023_normal,normal,4.567931,16.602734,42.477752,25.562376,23.221137,11.622386,14.330573,72.445474,...,6.871902,1.815112,1.815112,1.815112,1.815112,1.815112,10.427023,1.815112,1.815112,1.815112
2,STT5430_Breast_002_normal,normal,2.077597,3.978294,12.863214,13.728915,14.543176,14.141907,6.23279,57.011005,...,7.096343,2.077597,2.077597,2.077597,2.077597,2.077597,22.344226,2.077597,2.077597,2.077597
3,STT5439_Breast_003_normal,normal,2.066576,8.520713,14.466035,7.823932,8.520713,2.066576,10.870009,53.292034,...,5.20077,2.066576,2.066576,2.066576,2.066576,2.066576,49.295538,2.066576,2.066576,2.066576
4,STT5441_Breast_004_normal,normal,2.613616,3.434965,12.682222,10.543189,26.688686,12.484822,1.364917,67.140393,...,11.22777,1.364917,1.364917,1.364917,1.364917,1.364917,23.627911,1.364917,1.364917,1.364917


### 1
**t-test**

In [98]:
genes = df.columns[2:].tolist()
print(len(genes))

15748


In [29]:
d1, d2 = {}, {}
for gen in genes:
    a = df.loc[df['Diagnosis']=='normal', gen] 
    b = df.loc[df['Diagnosis']=='early neoplasia', gen]
    c = df.loc[df['Diagnosis']=='cancer', gen]
    
    #compute p_value from t_test with 2 ind sample
    d1[gen] = stats.ttest_ind(a, b, equal_var=False)[1]
    d2[gen] = stats.ttest_ind(b, c, equal_var=False)[1]


In [32]:
s1, s2 = 0, 0
for v in d1.values():
    if v < ALPHA:
        s1 += 1
for v in d2.values():
    if v < ALPHA:
        s2 += 1
print(s1, s2)

1575 3490


### 2

**Corrections Holm-Bonferroni**

FWER - familywise error rate

Correction with Holm method for testing across 2 groups (normal-contral and early neoplasia-treatment, early neoplasia-control and cancer-treatment)

And Bonfferoni for testing between normal, early neoplasia and early neoplasia, cancer. It means that new $\alpha' = \alpha / 2$

In [138]:
def fold_change(c,t):
    '''Compute metrics called fold change
    Thist metrics is used in bioinformatics to detect practical significance
    
    Parameters:
        c - mean value in control group
        t - mean value in treatment group
        
    Return:
        fold change
    '''
    
    return t/c if t >= c else -c/t

In [36]:
import statsmodels.stats.multitest as smm

In [145]:
reject1, p_corrected1, a11, a12 = smm.multipletests(list(d1.values()), alpha=ALPHA/2, method='holm')
reject2, p_corrected2, a21, a22 = smm.multipletests(list(d2.values()), alpha=ALPHA/2, method='holm')

In [146]:
print(sum(reject1), sum(reject2))

2 79


In [147]:
ix1 = np.argwhere(reject1==True).squeeze()
ix2 = np.argwhere(reject2==True).squeeze()

In [148]:
genes1, genes2 = [], []
for i in ix1:
    genes1.append(genes[i])
for i in ix2:
    genes2.append(genes[i])

In [149]:
f1, f2 = [], []
for gen in genes1:
    f1.append(fold_change(df.loc[df['Diagnosis']=='normal', gen].mean(), df.loc[df['Diagnosis']=='early neoplasia', gen].mean()))
for gen in genes2:
    f2.append(fold_change(df.loc[df['Diagnosis']=='early neoplasia', gen].mean(), df.loc[df['Diagnosis']=='cancer', gen].mean()))

In [150]:
#1.5 got from the task
print(len([x for x in f1 if abs(x) > 1.5]), len([x for x in f2 if abs(x) > 1.5]))

2 77


### 3

**Correction Benjamini–Hochberg**

FDR - false discovery rate

Algorithm is the same as in the 2 item

In [139]:
reject1, p_corrected1, a11, a12 = smm.multipletests(list(d1.values()), alpha=ALPHA/2, method='fdr_bh')
reject2, p_corrected2, a21, a22 = smm.multipletests(list(d2.values()), alpha=ALPHA/2, method='fdr_bh')

In [140]:
print(sum(reject1), sum(reject2))

4 832


In [141]:
ix1 = np.argwhere(reject1==True).squeeze()
ix2 = np.argwhere(reject2==True).squeeze()

In [142]:
genes1, genes2 = [], []
for i in ix1:
    genes1.append(genes[i])
for i in ix2:
    genes2.append(genes[i])

In [143]:
f1, f2 = [], []
for gen in genes1:
    f1.append(fold_change(df.loc[df['Diagnosis']=='normal', gen].mean(), df.loc[df['Diagnosis']=='early neoplasia', gen].mean()))
for gen in genes2:
    f2.append(fold_change(df.loc[df['Diagnosis']=='early neoplasia', gen].mean(), df.loc[df['Diagnosis']=='cancer', gen].mean()))

In [144]:
#1.5 got from the task
print(len([x for x in f1 if abs(x) > 1.5]), len([x for x in f2 if abs(x) > 1.5]))

4 524
