### Abstract

We find histologies associated with tissues containing NF1 somatic mutations in the COSMIC database. There are 21. We find all other gene somatic mutations for those histologies.  We assess the main predisposing gene to be the gene with the most mutation-sample pairs associated with the histology.  We find that 4 histologies are primarily associated with NF1.  We find that the mutation sites in NF1 observed to the 4 histologies each are associated with only a single histology. For MPNST, 3 genes are equally scored and Neurofibroma has a very high score for NF1. We find that each histology is associated with multiple mutation sites. For the NF1 mutation sites associated with each histology, we give FATHM score and PUBMED ID.

NOTE: All of the samples are taken from people with somatic mutations leading to cancer, so there is selection bias in the samples to that extent.

TODO: Program-based search for variant type, affected pathways and clinical significance from NCBI.

### Required packages

In [1]:
import os, re, time
import pandas as pd
import numpy as np

In [214]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Getting the COSMIC database of genome screens

For development purposes we have downloaded the COSMIC database from their website.  We have also uploaded it as a ZIP file to private file syn20716008.  It has license and sharing conditions and is 10GB in size, which makes it difficult to Dockerize.  Let's assume you have downloaded the file and placed it in a folder, and that there is an environment variable NF2_DATA that points to that folder.  This file, uncompressed, takes about an hour to read on a laptop PC with 16GB of RAM and 45GB of paging space.  It takes about 10 minutes to read on a workstation with 32GB of RAM, Intel core i9 processor and solid state disk.

In [3]:
data_dir=os.getenv('NF2_DATA')
df=pd.read_csv(f'{data_dir}/CosmicGenomeScreensMutantExport.tsv', sep='\t')
df=df.loc[:,['Gene name','Mutation CDS', 'ID_sample', 'Primary histology']].dropna()
df.columns=['gene', 'mutation', 'sampleid', 'histology']

### We find histologies associated with tissues containing NF1 somatic mutations in the COSMIC database

#### Remove Not Sure histologies

In [4]:
df=df[df.histology != 'other']

In [5]:
df=df[df.histology != 'NS']

#### Make another dataframe for just NF1 mutations

In [6]:
nf1=df[df.gene=='NF1']
dd=nf1.to_dict()

In [135]:
nf1_histologies=set([y for x,y in dd['histology'].items()])

#### There are 21

In [137]:
len(nf1_histologies)

21

### We find all other gene somatic mutations for those histologies

#### Get the list of tissue samples studied for NF1

In [8]:
nf1_samples=set([y for x,y in dd['sampleid'].items()])

#### Restrict the set of all mutations to those concerning NF1-related histologies and samples

In [44]:
def strip_gene(x):
    return x.split('_')[0] if '_' in x else x

In [10]:
df_nf1=df[df.histology.isin(nf1_histologies)]
df_nf1=df_nf1[df_nf1.sampleid.isin(nf1_samples)]
genes=df_nf1.gene.unique()
samples=df_nf1.sampleid.unique()
histologies=df_nf1.histology.unique()
M=df_nf1.values
f=np.vectorize(strip_gene)
M[:,0]=f(M[:,0])
W={histology:{} for gene, mutation, sample, histology in M}
for gene, mutation, sample, histology in M:
    W[histology][gene]={}
for gene, mutation, sample, histology in M:
    W[histology][gene][mutation]=set([])
for gene, mutation, sample, histology in M:
    W[histology][gene][mutation].add(sample)

### We assess the main predisposing gene for each histology

#### Our scoring function is the gene with the most mutation-sample pairs associated with the histology

In [77]:
n=3

In [118]:
def score(mutation_samples):
    return sum([len(mutation_samples[x]) for x in mutation_samples])

In [144]:
causative = []
not_causative = []
for histology in W:
    TT=list(reversed(sorted([(score(W[histology][gene]),gene) for gene in W[histology]])))
    total_mutations=sum([x for x,y in TT])
    TT1=[(y,x/total_mutations) for x,y in TT]
    genes=[x for x,y in TT1]
    if 'NF1' in genes[0:n]:
        i=[i for i,x in enumerate(genes[0:20]) if x=='NF1'][0]
        if genes[0]=='NF1' or TT1[0][1]==TT1[i][1]:
            causative.append((histology, [x for x in TT1 if x[1]==TT1[0][1]]))
        else:
            not_causative.append(histology)
    else:
        not_causative.append(histology)

#### 17 histologies do not score as causing NF1

In [145]:
pd.DataFrame(not_causative, columns=['Not caused by NF1'])

Unnamed: 0,Not caused by NF1
0,carcinoma
1,lymphoid_neoplasm
2,haematopoietic_neoplasm
3,malignant_melanoma
4,glioma
5,carcinoid-endocrine_tumour
6,angiosarcoma
7,adenoma
8,primitive_neuroectodermal_tumour-medulloblastoma
9,Ewings_sarcoma-peripheral_primitive_neuroectod...


#### We find that 4 histologies are scored as causative for  NF1

In [153]:
rows=[]
for histo,L in causative:
    for (gene, score) in L:
        rows.append((score, gene, histo))
rows.sort()
rows=list(reversed(rows))

In [155]:
nf1_histologies=[x for x,y in causative]
pd.DataFrame(nf1_histologies, columns=['histology'])

Unnamed: 0,histology
0,pheochromocytoma
1,neuroblastoma
2,malignant_peripheral_nerve_sheath_tumour
3,neurofibroma


In [168]:
n_nf1_histologies=len(nf1_histologies)
n_nf1_histologies

4

### For MPNST, 3 genes are equally scored and Neurofibroma has a very high score for NF1

In [160]:
pd.DataFrame(rows, columns=['score', 'gene', 'histology'])

Unnamed: 0,score,gene,histology
0,0.909091,NF1,neurofibroma
1,0.222222,SAMD4A,malignant_peripheral_nerve_sheath_tumour
2,0.222222,NF1,malignant_peripheral_nerve_sheath_tumour
3,0.222222,EHBP1,malignant_peripheral_nerve_sheath_tumour
4,0.152838,NF1,pheochromocytoma
5,0.046005,NF1,neuroblastoma


### Multiple NF1 mutation sites are associated with each histology

In [245]:
nf1_mutations={histology:[x for x in W[histology]['NF1']] for histology in nf1_histologies}

In [246]:
pd.DataFrame([(x, len(nf1_mutations[x]), ','.join(sorted(nf1_mutations[x]))) for x in nf1_mutations],
               columns=['histology', 'n_mutations', 'mutations'])

Unnamed: 0,histology,n_mutations,mutations
0,pheochromocytoma,22,"c.1398del,c.1400C>T,c.1448A>G,c.1466A>G,c.1658..."
1,neuroblastoma,17,"c.*2307C>A,c.1039C>T,c.1063-58G>A,c.1977G>T,c...."
2,malignant_peripheral_nerve_sheath_tumour,2,"c.6642-17G>A,c.6705-17G>A"
3,neurofibroma,10,"c.1324_1331del,c.3043del,c.5565_5567del,c.5628..."


###  NF1 mutation sites are each observed for only a single histology

In [172]:
intersections=[]
for i in range(n_nf1_histologies):
    hist_i=nf1_histologies[i]
    W_i=nf1_mutations[hist_i]
    for j in range(i+1,n_nf1_histologies):
        hist_j=nf1_histologies[j]
        W_j=nf1_mutations[hist_j]
        intersection=[x for x in W_i if x in W_j]
        intersections.append((hist_i,len(W_i), hist_j, len(W_j), len(intersection)))

In [177]:
pd.DataFrame(intersections, columns=['hist A', '# Mutations A', 'hist B', '# MutationsB', '# Intersections'])

Unnamed: 0,hist A,# Mutations A,hist B,# MutationsB,# Intersections
0,pheochromocytoma,22,neuroblastoma,17,0
1,pheochromocytoma,22,malignant_peripheral_nerve_sheath_tumour,2,0
2,pheochromocytoma,22,neurofibroma,10,0
3,neuroblastoma,17,malignant_peripheral_nerve_sheath_tumour,2,0
4,neuroblastoma,17,neurofibroma,10,0
5,malignant_peripheral_nerve_sheath_tumour,2,neurofibroma,10,0


## Information from COSMIC and NCBI on each NF1 mutation associated with a histology

For the NF1 mutation sites associated with each histology, we manually gather and present the variant type, affected pathways and clinical significance.

In [231]:
cosmic=pd.read_csv(f'{data_dir}/CosmicGenomeScreensMutantExport.tsv', sep='\t')

In [253]:
def nf1_hist_mutation_info(hist):
    global cosmic, nf1_mutations
    L=nf1_mutations[hist]
    df = cosmic[cosmic['Primary histology']==hist]
    df = df[df['Mutation CDS'].isin(L)]
    df = df[['Mutation CDS',
       'Mutation AA', 'Mutation Description', 'Mutation zygosity', 
       'Mutation strand', 'SNP',
       'FATHMM prediction', 'FATHMM score',
       'Pubmed_PMID']]
    return df.drop_duplicates()

### malignant_peripheral_nerve_sheath_tumour

In [254]:
nf1_hist_mutation_info('malignant_peripheral_nerve_sheath_tumour')

Unnamed: 0,Mutation CDS,Mutation AA,Mutation Description,Mutation zygosity,Mutation strand,SNP,FATHMM prediction,FATHMM score,Pubmed_PMID
17392781,c.6705-17G>A,p.?,Unknown,,+,n,NEUTRAL,0.43648,26432421.0
28568308,c.6642-17G>A,p.?,Unknown,,+,n,NEUTRAL,0.43648,26432421.0


### pheochromocytoma

In [255]:
nf1_hist_mutation_info('pheochromocytoma')

Unnamed: 0,Mutation CDS,Mutation AA,Mutation Description,Mutation zygosity,Mutation strand,SNP,FATHMM prediction,FATHMM score,Pubmed_PMID
51506,c.1448A>G,p.D483G,Substitution - Missense,het,+,n,PATHOGENIC,0.98235,23781326.0
51508,c.2474C>G,p.S825C,Substitution - Missense,het,+,n,PATHOGENIC,0.98167,23781326.0
607775,c.1398del,p.T467Hfs*6,Deletion - Frameshift,het,+,n,,,25545346.0
724772,c.663G>A,p.W221*,Substitution - Nonsense,het,+,n,PATHOGENIC,0.99127,25545346.0
902327,c.1400C>T,p.T467I,Substitution - Missense,het,+,n,PATHOGENIC,0.97282,23781326.0
936893,c.1658A>G,p.H553R,Substitution - Missense,het,+,n,PATHOGENIC,0.99034,23781326.0
2174733,c.2422del,p.L808Ffs*13,Deletion - Frameshift,het,+,n,,,23781326.0
2174745,c.6705-1G>T,p.?,Unknown,het,+,n,PATHOGENIC,0.99689,25545346.0
2908333,c.6522_6523del,p.E2174Dfs*46,Deletion - Frameshift,,+,n,,,25545346.0
4121999,c.1466A>G,p.Y489C,Substitution - Missense,het,+,n,PATHOGENIC,0.96016,23781326.0


### neuroblastoma

In [256]:
nf1_hist_mutation_info('neuroblastoma')

Unnamed: 0,Mutation CDS,Mutation AA,Mutation Description,Mutation zygosity,Mutation strand,SNP,FATHMM prediction,FATHMM score,Pubmed_PMID
4147230,c.1063-58G>A,p.?,Unknown,,+,n,NEUTRAL,0.08494,
4475582,c.6691G>T,p.E2231*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99796,23334666.0
5184719,c.2001+39G>T,p.?,Unknown,,+,n,NEUTRAL,0.15611,
6271437,c.*2307C>A,p.?,Unknown,,+,n,NEUTRAL,0.0368,
8746460,c.1039C>T,p.Q347*,Substitution - Nonsense,,+,n,PATHOGENIC,0.97489,
10515955,c.1977G>T,p.R659=,Substitution - coding silent,,+,n,PATHOGENIC,0.84177,
11578764,c.910C>G,p.R304G,Substitution - Missense,,+,n,PATHOGENIC,0.8791,
11907901,c.4836-20659G>T,p.?,Unknown,,+,n,NEUTRAL,0.11841,23334666.0
11932998,c.6628G>T,p.E2210*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99796,23334666.0
12614951,c.7501G>T,p.E2501*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99166,


### neurofibroma

In [257]:
nf1_hist_mutation_info('neuroblastoma')

Unnamed: 0,Mutation CDS,Mutation AA,Mutation Description,Mutation zygosity,Mutation strand,SNP,FATHMM prediction,FATHMM score,Pubmed_PMID
4147230,c.1063-58G>A,p.?,Unknown,,+,n,NEUTRAL,0.08494,
4475582,c.6691G>T,p.E2231*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99796,23334666.0
5184719,c.2001+39G>T,p.?,Unknown,,+,n,NEUTRAL,0.15611,
6271437,c.*2307C>A,p.?,Unknown,,+,n,NEUTRAL,0.0368,
8746460,c.1039C>T,p.Q347*,Substitution - Nonsense,,+,n,PATHOGENIC,0.97489,
10515955,c.1977G>T,p.R659=,Substitution - coding silent,,+,n,PATHOGENIC,0.84177,
11578764,c.910C>G,p.R304G,Substitution - Missense,,+,n,PATHOGENIC,0.8791,
11907901,c.4836-20659G>T,p.?,Unknown,,+,n,NEUTRAL,0.11841,23334666.0
11932998,c.6628G>T,p.E2210*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99796,23334666.0
12614951,c.7501G>T,p.E2501*,Substitution - Nonsense,,+,n,PATHOGENIC,0.99166,
