# Term Enrichment

Calculate the enrichment of each GO term, for each IsoClass

## Algo
First calculate the number of PDBs in the base set (=N)
- For each IsoClass:
    - Find the number of PDBs in the IsoClass (= n)
    - For each GO term:
        - Find the number of times the term occurs in the IsoClass (= k)
        - Find the number of times the term occurs in the base set (= K)
        - Get the p-value using the CDF of the hypergeometric distribution
        - Store the p-value

In [17]:
# Read the PDB references corresponding to the isoClasses, and from them construct the base set.
with open("./isomorphicProteins.dat") as flines:
    isoClasses = [line.strip().split(" ") for line in flines]

allProteins = set(x for sublist in isoClasses for x in sublist)  # The "base data"
N = len(allProteins)

# Strip out the isoClasses with only one protein.
isoClasses = [x for x in isoClasses if len(x) != 1]


In [5]:
# Pull the data into pandas. Then filter such that only rows with PDB
# in basePDBs is there (For getting K later).
import pandas as pd

df = pd.read_csv("pdb_chain_go.tsv", sep="\t", header=0, usecols=["PDB", "GO_ID"])
df = df[df['PDB'].isin(allProteins)]

In [6]:
df.head()

Unnamed: 0,PDB,GO_ID
2870,16pk,GO:0004618
2871,16pk,GO:0006096
2872,16pk,GO:0004618
2873,16pk,GO:0006096
2874,16pk,GO:0004618


In [10]:
# Lists of how common each GO term is. (Gives K)
GOcounts = df['GO_ID'].value_counts()

In [47]:
from scipy.stats import hypergeom
enrichmentJson = []
for i, isoClass in enumerate(isoClasses):
    n = len(isoClass)
    # Get the number of occurrences of each GO term that occurs at least once in the dataset.
    GOtermsPresent = df[df['PDB'].isin(isoClass)]['GO_ID'].value_counts()
    # Get the enrichment for each term
    GOterms = []
    for GOterm, k in GOtermsPresent.iteritems():
        K = GOcounts[GOterm]
        GOterms.append(dict({"GOlabel": str(GOterm), "k": str(k), "N": str(N), "n": str(n), "K": str(K), "p": str(hypergeom.cdf(k, N, n, K))}))

    enrichmentDict= {"isoClass": isoClass, "GOterms": GOterms}
    enrichmentJson.append(dict(enrichmentDict))

import json
with open("GOterms.json", mode='w') as flines:
    json.dump(enrichmentJson, flines, indent=2)
        