# Setup

In [1]:
import os
import re
import gzip

import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
from statsmodels.stats.multitest import multipletests

from Bio import SeqIO

from IPython.display import display

DIR = r'c://downloads'

# Q1

According to [Wikipedia](https://en.wikipedia.org/wiki/PDZ_domain), PDZ domains play a key role in anchoring receptor proteins in the membrane to cytoskeletal components and in the formation and function of signal transduction complexes, including transport and ion channel signaling.

# Q2

To find keywords significantly associated with the PDZ domain, we will carry out the analysis at the resolution of proteins (not domains) since proteins are the entities that are annotated with keywords. Furthermore, we will probably want to make statements about the functions of proteins more than about individual domains, so we should analyze the entities that we will later want to make inference about.

Overall strategy:
1. Extract the UniProt IDs of all human proteins that contain a PDZ domain from Pfam.
2. Extract all proteins with their keywords from UniProt (for simplicity, focusing only on human proteins).
3. Run an independent Fisher's exact test for each keyword, and finally correct for multiple testing (see details in the comments below).

Note that a chi-squared test cannot be used here for two reasons:
1. Keywords are NOT mutually exclusive.
2. We want to know WHICH of the keywords are significant (not only that there are indeed some significant keywords).

In [2]:
# From: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/proteomes/9606.tsv.gz
human_pfam_records = pd.read_csv(os.path.join(DIR, '9606.tsv.gz'), sep = r'\t|> <', skiprows = 2, na_values = ['No_clan'], \
        engine = 'python').rename(columns = lambda name: re.sub(r'[ \-]', '_', re.sub(r'[#\<\>]', '', name)).lower())
human_pfam_domains = human_pfam_records[human_pfam_records['type'] == 'Domain']
display(human_pfam_records)

Unnamed: 0,seq_id,alignment_start,alignment_end,envelope_start,envelope_end,hmm_acc,hmm_name,type,hmm_start,hmm_end,hmm_length,bit_score,e_value,clan
0,A0A024QZ18,69,147,66,147,PF00595,PDZ,Domain,4,82,82,51.3,1.600000e-10,CL0466
1,A0A024QZ33,5,123,4,123,PF09745,DUF2040,Coiled-coil,2,121,121,124.9,2.600000e-33,
2,A0A024QZ42,25,84,22,86,PF13499,EF-hand_7,Domain,4,69,71,41.7,1.800000e-07,CL0220
3,A0A024QZB8,40,436,39,437,PF02487,CLN3,Family,2,398,399,461.1,4.800000e-135,
4,A0A024QZP7,4,287,4,287,PF00069,Pkinase,Domain,1,264,264,258.8,7.400000e-74,CL0016
5,A0A024QZX5,11,380,10,380,PF00079,Serpin,Domain,2,370,370,435.8,2.300000e-127,
6,A0A024R0K5,40,140,38,141,PF07686,V-set,Domain,3,108,109,47.4,2.600000e-09,CL0011
7,A0A024R0K5,239,318,239,318,PF13895,Ig_2,Domain,1,79,79,62.7,4.500000e-14,CL0011
8,A0A024R0K5,418,495,417,496,PF13895,Ig_2,Domain,2,78,79,56.9,3.000000e-12,CL0011
9,A0A024R0K5,596,663,595,674,PF13895,Ig_2,Domain,2,69,79,40.2,4.700000e-07,CL0011


In [3]:
pdz_protein_ids = set(human_pfam_domains.loc[human_pfam_domains['hmm_name'] == 'PDZ', 'seq_id'])
print('There are %d proteins with a PDZ domain.' % len(pdz_protein_ids))

There are 381 proteins with a PDZ domain.


In [4]:
'''
We build a dataframe of all ~20K reviewed human proteins from UniProt, and all the keywords of each of these proteins.
Note that the 'keywords' column is of type set (i.e. each entry in this column is a set of strings).
'''

uniprot_ids = []
keywords = []

# From: http://www.uniprot.org/uniprot/?query=organism%3A%22Homo+sapiens+%5B9606%5D%22+AND+reviewed%3Ayes&sort=score
with gzip.open(os.path.join(DIR, 'uniprot_human_reviewed.xml.gz'), 'rt') as f:
    for record in SeqIO.parse(f, 'uniprot-xml'):
        uniprot_ids.append(record.id)
        keywords.append(set(record.annotations['keywords']))
        
protein_data = pd.DataFrame({'uniprot_id': uniprot_ids, 'keywords': keywords})
display(protein_data)

Unnamed: 0,uniprot_id,keywords
0,P32929,"{Cysteine biosynthesis, 3D-structure, Lyase, A..."
1,O14646,"{Nucleus, Transcription, Mental retardation, P..."
2,Q8IWX8,"{Phosphoprotein, Polymorphism, Reference prote..."
3,Q99653,"{3D-structure, Cell membrane, Phosphoprotein, ..."
4,O94983,"{Alternative splicing, Polymorphism, Reference..."
5,P42695,"{Mitosis, Phosphoprotein, Polymorphism, Refere..."
6,Q8IYT2,"{S-adenosyl-L-methionine, Polymorphism, mRNA c..."
7,Q9NSA3,"{3D-structure, Phosphoprotein, Reference prote..."
8,Q96KP4,"{3D-structure, Carboxypeptidase, Alternative s..."
9,O95476,"{Polymorphism, Transmembrane helix, Reference ..."


In [5]:
all_keywords = set.union(*protein_data['keywords'])
print('There are %d unique keywords.' % len(all_keywords))

protein_data['contains_pdz'] = protein_data['uniprot_id'].isin(pdz_protein_ids)
print('%d of the %d reviewed proteins contain a PDZ domain.' % (protein_data['contains_pdz'].sum(), len(protein_data)))

display(protein_data.head())

There are 723 unique keywords.
129 of the 20412 reviewed proteins contain a PDZ domain.


Unnamed: 0,uniprot_id,keywords,contains_pdz
0,P32929,"{Cysteine biosynthesis, 3D-structure, Lyase, A...",False
1,O14646,"{Nucleus, Transcription, Mental retardation, P...",False
2,Q8IWX8,"{Phosphoprotein, Polymorphism, Reference prote...",False
3,Q99653,"{3D-structure, Cell membrane, Phosphoprotein, ...",False
4,O94983,"{Alternative splicing, Polymorphism, Reference...",False


In [16]:
'''
We iterate over all the unique keywords, and for each one we independently build a 2X2 contingency table. This table counts
how many proteins there are (of the entire ~20K reviewed human proteins) that have or don't have this keyword and have or don't
have a PDZ domain. The null hypothesis is that the rows and columns of this table are independent (i.e. no correlation between
the presence of the keyword and the presence of PDZ domain). We run a Fisher's exact test to calculate p-values. We also
calculate Risk-Ratio (RR) to measure the effect size.
'''

test_keywords = []
test_RRs = []
test_pvals = []

for keyword in all_keywords:
    
    # A mask matching to the protein_data dataframe for the given keyword (i.e. the mask will have a True value only
    # for proteins that have this keyword, False otherwise).
    keyword_mask = protein_data['keywords'].apply(lambda protein_keywords: keyword in protein_keywords)
    
    contingency_table = pd.DataFrame(0, index = ['pdz', 'no_pdz'], columns = ['keyword', 'no_keyword'])
    
    for row_index in range(2):
        
        # A mask over protein_data matching the row criterion (pdz or no_pdz)
        row_mask = protein_data['contains_pdz'] if row_index == 0 else ~protein_data['contains_pdz']
    
        for column_index in range(2):
            # A mask over protein_data matching the column criterion (keyword or no_keyword)
            column_mask = keyword_mask if column_index == 0 else ~keyword_mask
            # (row_mask & column_mask) will give the cell's specific mask (over protein_data). Summing this mask will give the
            # number of proteins that match the overall criterion for this cell (pdz vs. no_pdz AND keyword vs. no_keyword)
            contingency_table.iloc[row_index, column_index] = (row_mask & column_mask).sum()
    
    # The "risk" of having the keyword for a protein containing a PDZ domain
    pdz_risk = contingency_table.loc['pdz', 'keyword'] / contingency_table.loc['pdz', :].sum()
    # The "risk" of having the keyword for a protein NOT containing a PDZ domain
    no_pdz_risk = contingency_table.loc['no_pdz', 'keyword'] / contingency_table.loc['no_pdz', :].sum()
    
    if no_pdz_risk == 0:
        RR = np.nan
    else:
        RR = pdz_risk / no_pdz_risk
    
    _, pval = fisher_exact(contingency_table)
    
    test_keywords.append(keyword)
    test_RRs.append(RR)
    test_pvals.append(pval)
    
keyword_tests = pd.DataFrame({'keyword': test_keywords, 'RR': test_RRs, 'pval': test_pvals})
display(keyword_tests)

Unnamed: 0,keyword,RR,pval
0,Peroxisome biogenesis,0.000000,1.000000
1,Vision,1.379233,0.518638
2,Endonuclease,0.000000,1.000000
3,Antiviral defense,0.000000,1.000000
4,Sensory transduction,0.261618,0.190308
5,Ubl conjugation,1.079109,0.686731
6,Complement alternate pathway,0.000000,1.000000
7,Transit peptide,0.000000,0.052706
8,Target membrane,0.000000,1.000000
9,Paired box,0.000000,1.000000


In [17]:
EPSILON = 1e-5

# We must account for multiple testing!
keyword_tests['significance'], keyword_tests['qval'], _, _ = multipletests(keyword_tests['pval'], method = 'bonferroni')
significant_keyword_tests = keyword_tests[keyword_tests['significance']]
print('%d of %d keywords are significantly associated (or anti-associated) with the presence PDZ:' % \
        (len(significant_keyword_tests), len(keyword_tests)))

# We sort the significant keyword tests by max{RR, 1 / (RR + EPSILON)} (we use EPSILON to avoid division by zero)
significant_keyword_tests = significant_keyword_tests.assign(absolute_RR = significant_keyword_tests['RR'].apply(\
        lambda RR: max(RR, 1 / (RR + EPSILON)))).sort_values('absolute_RR', ascending = False).drop('absolute_RR', axis = 1)
display(significant_keyword_tests)

27 of 723 keywords are significantly associated (or anti-associated) with the presence PDZ:


Unnamed: 0,keyword,RR,pval,significance,qval
411,Receptor,0.0,4.593681e-05,True,0.03321231
408,DNA-binding,0.0,2.159798e-06,True,0.001561534
276,Usher syndrome,52.410853,5.203706e-05,True,0.03762279
639,Tight junction,38.300239,1.9117230000000002e-23,True,1.382175e-20
31,Glycoprotein,0.033376,1.344335e-13,True,9.719542e-11
552,LIM domain,25.775829,2.377872e-11,True,1.719202e-08
182,Synaptosome,25.728964,2.491073e-10,True,1.801045e-07
106,Disulfide bond,0.042052,2.562663e-10,True,1.852805e-07
212,Signal,0.042983,4.186233e-10,True,3.026646e-07
591,Transmembrane,0.059955,4.30399e-14,True,3.111785e-11


Note that some of the significant keywords are associated with PDZ (RR > 1), while others are anti-associated (RR < 1)!

Note that the statistical test we carried out could be biased in many ways:

1. We rely on UniProt's annotations for keywords, which could be incomplete (lacking keywords that should have been part of a protein's annotation but weren't). Note that keywords and GO annotations are very sensitive to definitions and methodology, which vary between different databases. Different (yet equally justified) keyword definitions could have produced very different results.

2. We rely on Pfam's HMM model for defining the PDZ domain. Like with UniProt's keywords, there could be false positives and false negatives.

3. The existence of paralogs (duplications within the human proteome) could call to question our assumption that protein annotations are independent (Fisher's exact, like all tests, assumes i.i.d records).

4. We have looked only at human proteins, which could be a bad representation of the entire space of proteins. However, if we looked at all UniProt proteins, we would be susceptible to even greater biases due to:

    4.1 Overlapping between orthologs (homologs across different species) would probably dominate the analysis (much more than homologs within the same species) if we didn't account for them (e.g. by taking only "unique representatives" - this is quite challenging to define).
    
    4.2. The sapce of species covered in UniProt is biased by the species that have been more studied (like all databases that attempt to cover all organisms).
    
    4.3. Even when a species is represented in UniProt, the quality of annotations (e.g. of the keywords) is not identical across the board. More studied specieis should have better, more accurate annotations, and the annotations for human proteins are probably the most comprehensive. The same goes for Pfam annotations.

    For all of these reasons, the analysis was carried out only for human proteins. But it means that the results should not be blindly generalized to other species.
    

Other than these potential biases, it is expected that the associations between various functional proteomic keywords to the presence of PDZ domains are due to genuine functional characteristics of PDZ-containing proteins. Such statistical characteristics can be directly dervied from the function of PDZ, or be indirectly associated. For example, the keyword "Postsynaptic cell membrane" is directly related to the role of PDZ in the brain, so it's not surprising to find it very significantly and strongly enriched in PDZ-containing proteins. The depletion of the "DNA-binding" keyword in PDZ-containing proteins, on the other hand, is more indirect, but not surprising: membrane proteins are less likely to find their way into the nucleus and bind DNA.

Enrichment of the following keywords is expected given the known roles of PDZ: Synaptosome, Postsynaptic cell membrane, Cell junction, Synapse, Cytoskeleton, Cell membrane, Phosphoprotein, Membrane.

Depletion of the following keywords is more surprising and unexpected: Glycoprotein, Signal, Transmembrane. 