# Proteasome proteins' expression in tumor vs. normal samples

This notebook uses permutation testing to investigate the abundance levels of proteins making up the proteasome in tumor vs. normal samples.

Import needed packages.

In [1]:
import cptac
import pcanalyzer as pc
import seaborn as sns
import matplotlib.pyplot as plt

Load the datasets.

In [2]:
datasets = []

#datasets.append(cptac.Brca()) # Brca has no normal samples
datasets.append(cptac.Ccrcc())
#datasets.append(cptac.Colon())
#datasets.append(cptac.Endometrial())

#datasets.append(cptac.Gbm())
#datasets.append(cptac.Hnscc())

#datasets.append(cptac.Lscc())
#datasets.append(cptac.Luad())

#datasets.append(cptac.Ovarian())

                                    

Get a list of proteins within the proteasome. We have data from both CORUM and HGNC on which proteins are contained in particular protein complexes. However, the HGNC data is more comprehensive--it contains proteins that aren't  included in the CORUM data. Additionally, the CORUM data breaks it up into more specific  subgroups than HGNC, so the HGNC data is easier to query when we just want all proteins associated with a particular structure.

In [3]:
hgnc_lists = pc.get_hgnc_protein_lists()
proteasome_proteins = sorted(set(hgnc_lists["Proteasome"]))

Test which proteins have a significant difference between the mean expression in tumor vs. normal samples.

In [4]:
# Run the tests
num_permutations = 10000
results, num_tests = pc.perm_test_omics_pancancer(
    datasets=datasets,
    id_list=proteasome_proteins,
    data_type="proteomics",
    num_permutations=num_permutations)

# Set our alpha and calculate the minimum P value we can report
alpha = 0.05 / num_tests
min_P = 1 / num_permutations

# Sort out results between those above and below alpha, and print the results
print(f"Num tests: {num_tests}\nAlpha: {alpha}\nNum permutations: {num_permutations}\nMin P: {min_P}\n")
for dataset, ds_results in results.items():
    sig = []
    not_sig = []
    for protein_result in ds_results:
        if protein_result[1] <= alpha:
            sig.append(protein_result)
        else:
            not_sig.append(protein_result)
            
    print(f"{dataset}\n\tP <= alpha:")
    for sig_res in sig:
        print(f"\t\t{sig_res}")
    print(f"\n\tP > alpha:")
    for not_sig_res in not_sig:
        print(f"\t\t{not_sig_res}")

Num tests: 44
Alpha: 0.0011363636363636365
Num permutations: 10000
Min P: 0.0001

ccrcc
	P <= alpha:
		(('PSMA1', 'NP_002777.1'), 0.0)
		(('PSMA2', 'NP_002778.1'), 0.0)
		(('PSMA3', 'NP_002779.1'), 0.0)
		(('PSMA4', 'NP_001096137.1'), 0.0)
		(('PSMA5', 'NP_002781.2'), 0.0)
		(('PSMA6', 'NP_002782.1'), 0.0)
		(('PSMA7', 'NP_002783.1'), 0.0)
		(('PSMB1', 'NP_002784.1'), 0.0)
		(('PSMB10', 'NP_002792.1'), 0.0)
		(('PSMB2', 'NP_002785.1'), 0.0)
		(('PSMB3', 'NP_002786.2'), 0.0001)
		(('PSMB4', 'NP_002787.2'), 0.0)
		(('PSMB5', 'NP_002788.1'), 0.0)
		(('PSMB6', 'NP_002789.1'), 0.0)
		(('PSMB7', 'NP_002790.1'), 0.0)
		(('PSMB8', 'NP_683720.2'), 0.0)
		(('PSMB9', 'NP_002791.1'), 0.0)
		(('PSMC1', 'NP_002793.2'), 0.0)
		(('PSMC2', 'NP_002794.1'), 0.0)
		(('PSMC3', 'NP_002795.2'), 0.0)
		(('PSMC4', 'NP_006494.1'), 0.0)
		(('PSMC5', 'NP_002796.4'), 0.0)
		(('PSMC6', 'NP_002797.3'), 0.0)
		(('PSMD1', 'NP_001177966.1'), 0.0)
		(('PSMD10', 'NP_002805.1'), 0.0)
		(('PSMD12', 'NP_002807.1'), 0.0)
		(

In [5]:
# Next: Write a function to do boxplots, based off Amanda's, and then plot the results for proteins 
# with a significant difference. Then, test it on the supercomputer for all proteins will small
# numbers of permutations; then run with large numbers of permutations.