# Explore some data


# GENIE Datasets

### Synapse Links

* GENIE 13.3 consortium https://www.synapse.org/#!Synapse:syn36709873
  * data_clinical_patient https://www.synapse.org/#!Synapse:syn36710136
  * data_clinical_sample https://www.synapse.org/#!Synapse:syn36710137
  * data_mutations_extended https://www.synapse.org/#!Synapse:syn36710142
  * data_CNA https://www.synapse.org/#!Synapse:syn36710134
  * data_cna_hg19_seg https://www.synapse.org/#!Synapse:syn36710143
* GENIE 12.0 public https://www.synapse.org/#!Synapse:syn32309524
  * data_clinical_patient https://www.synapse.org/#!Synapse:syn32689054
  * data_clinical_sample https://www.synapse.org/#!Synapse:syn32689057
  * data_mutations_extended https://www.synapse.org/#!Synapse:syn32689317
  * data_CNA https://www.synapse.org/#!Synapse:syn32689019
  * data_cna_hg19_seg https://www.synapse.org/#!Synapse:syn32689379
  
### Data Format Explanations

* https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/
* https://docs.cbioportal.org/file-formats/#mutation-data
* https://docs.cbioportal.org/file-formats/#discrete-copy-number-data
* https://docs.cbioportal.org/file-formats/#segmented-data
  


# NF Subtypes

https://hack4nf.slack.com/archives/C046RR7JFUL/p1665859968638679

```
Hi Lars, there is a predefined taxonomy of NF1 (e.g. MPNST, ANNUB, pNF, cNF, JMML, low grade glioma,), NF2 (e.g. schwannoma, meningioma), Schwannomatosis (e.g. schwannoma) tumor types.  Some tumors are connected too - eg pNF is thought to progress to ANNUBP which is thought to progress to MPNST, or there are subtypes of schwannoma - plexiform, melanotic, etc.
```

[Neurofibromatosis type 1-associated tumours: Their somatic mutational spectrum and pathogenesis](https://link.springer.com/article/10.1186/1479-7364-5-6-623)

### NF1

Accronyms / subtypes

* CNF: cutaneous and subcutaneous neurofibromas
* PNF: plexiform neurofibromas
* MPNST: malignant peripheral nerve sheath tumors

* ANNUBP: atypical neurofibromatous neoplasms of uncertain biologic potential

* JMML: juvenile myelomonocytic leukemias
* GIST: gastrointestinal stromal tumors
* OPG: optical pathway gliomas
* SC: schwann cell

Oncotree codes

* NST: Nerve Sheath Tumor
  * MPNST: Malignant Peripheral Nerve Sheath Tumor
  * NFIB: Neurofibroma
  * SCHW: Schwannoma
    * CSCHW: Cellular Schwannoma
    * MSCHW: Melanotic Schwannoma
 


## Inspiration

[Topological Methods for Visualization and Analysis of High Dimensional Single-Cell RNA Sequencing Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417818/)

[Genotype-phenotype correlation in neurofibromatosis type-1: NF1 whole gene deletions lead to high tumor-burden and increased tumor-growth](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8099117/)

[Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival](https://www.pnas.org/doi/10.1073/pnas.1102826108)

https://github.com/scikit-tda/kepler-mapper/tree/master/examples

https://kepler-mapper.scikit-tda.org/en/latest/generated/gallery/plot_breast_cancer.html

[Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery](https://www.frontiersin.org/articles/10.3389/fgene.2018.00682/full)

https://github.com/zeochoy/tcga-embedding/blob/master/train.py

In [None]:
from collections import Counter
import itertools
import json
import math
from pathlib import Path
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from scipy import sparse
from scipy.sparse import linalg
import seaborn as sns
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

In [None]:
from hack4nf import genie
from hack4nf.synapse import FILE_NAME_TO_PATH

In [None]:
#from IPython.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
pd.set_option('display.max_columns', 200)

In [None]:
#genie_dataset_version = "genie-12.0-public"
genie_dataset_version = "genie-13.3-consortium"

# Synapse File Paths

If you are not using the python client to sync synapse data then replace these file paths with the file paths on your local system. 

In [None]:
FILE_NAME_TO_PATH

In [None]:
syn_file_paths = {
    'data_clinical_patient': FILE_NAME_TO_PATH[genie_dataset_version]['data_clinical_patient'],
    'data_clinical_sample': FILE_NAME_TO_PATH[genie_dataset_version]['data_clinical_sample'],
    'data_mutations_extended': FILE_NAME_TO_PATH[genie_dataset_version]['data_mutations_extended'],
    'data_CNA': FILE_NAME_TO_PATH[genie_dataset_version]['data_CNA'],
    'data_cna_hg19_seg': FILE_NAME_TO_PATH[genie_dataset_version]['data_cna_hg19'],
}
syn_file_paths

# GENIE - Data Clinical Patient

In [None]:
df_dcp = genie.read_clinical_patient(syn_file_paths['data_clinical_patient'])

In [None]:
df_dcp

# GENIE - Data Clinical Sample

In [None]:
df_dcs = genie.read_clinical_sample(syn_file_paths['data_clinical_sample'])

In [None]:
df_dcs

### Samples per Patient

In [None]:
df_plt = (
    df_dcs.groupby('PATIENT_ID')
    .size()
    .sort_values()
    .to_frame("samples per patient")
    .reset_index()
)
df_plt

In [None]:
sns.histplot(df_plt, x='samples per patient', discrete=True, log_scale=(False, False))

# GENIE - Data Mutations Extended

In [None]:
df_dme = genie.read_mutations_extended(syn_file_paths['data_mutations_extended'])

In [None]:
df_dme

In [None]:
df_dme['HGVSp'].nunique()

### Variant Classification Distribution

In [None]:
df_plt = (
    df_dme.groupby('Variant_Classification')
    .size()
    .sort_values()
    .to_frame("count")
    .reset_index()
)
df_plt

In [None]:
g = sns.barplot(data=df_plt, x='count', y='Variant_Classification', color='grey')
#g.set_xscale("log")
g

### Variants per Sample

In [None]:
df_plt = (
    df_dme.groupby('Tumor_Sample_Barcode')
    .size()
    .sort_values()
    .to_frame("variants per sample")
    .reset_index()
)
df_plt

In [None]:
sns.histplot(
    df_plt, x='variants per sample', 
    log_scale=(True, True)
)

In [None]:
sns.histplot(
    df_plt[df_plt['variants per sample']<100], 
    x='variants per sample', 
    log_scale=(False, True)
)

# GENIE - Merge Patient/Sample/Mutations

In [None]:
df_var = pd.merge(
    df_dme, 
    df_dcs, 
    left_on="Tumor_Sample_Barcode", 
    right_on="SAMPLE_ID",
)
df_var = pd.merge(
    df_var, 
    df_dcp,
    on="PATIENT_ID",
)

In [None]:
df_var

### Distribution of NF related quantities 

In [None]:
df_var.shape

In [None]:
for gene in genie.NF_HUGO_SYMBOLS:
    print("Gene: ", gene)
    
    df = df_var[df_var['Hugo_Symbol'].isin([gene])]
    print("  num_variants: ", df.shape[0])
    print("  num_samples:  ", df['SAMPLE_ID'].nunique())
    print("  num_patients: ", df['PATIENT_ID'].nunique())

In [None]:
for onco in genie.NF_ONCOTREE_CODES:
    print("Oncotree code: ", onco)
    
    df = df_var[df_var['ONCOTREE_CODE'].isin([onco])]
    print("  num_variants: ", df.shape[0])
    print("  num_samples:  ", df['SAMPLE_ID'].nunique())
    print("  num_patients: ", df['PATIENT_ID'].nunique())

# GENIE - Data CNA (Discrete Copy Number Alteration Data)

https://docs.cbioportal.org/file-formats/#discrete-copy-number-data

For each gene-sample combination, a copy number level is specified:

* "-2" is a deep loss, possibly a homozygous deletion
* "-1" is a single-copy loss (heterozygous deletion)
* "0" is diploid
* "1" indicates a low-level gain
* "2" is a high-level amplification.

In [None]:
df_cna = genie.read_cna(syn_file_paths['data_CNA'])

In [None]:
df_cna

### Sparsity

In [None]:
all_cna_values = df_cna.fillna('MISSING').values.flatten()
non_null_cna_values = all_cna_values[all_cna_values!='MISSING']
non_zero_cna_values = non_null_cna_values[non_null_cna_values!=0]
set(non_zero_cna_values)

In [None]:
cna_counts = Counter(all_cna_values)
cna_counts.most_common()

In [None]:
df_cna_counts = pd.DataFrame(cna_counts.items(), columns=['value', 'count'])
df_cna_counts['frac'] = df_cna_counts['count'] / df_cna_counts['count'].sum()
df_cna_counts['perc'] = df_cna_counts['frac'] * 100
df_cna_counts['value_str'] = df_cna_counts['value'].astype(str)
df_cna_counts

In [None]:
df_cna_counts[~df_cna_counts['value_str'].isin(['MISSING', '0.0'])].sum()

In [None]:
sns.barplot(data=df_cna_counts, x='perc', y='value_str', color='grey')

### DIstribution of non-null / non-zero values

In [None]:
_ = plt.hist(non_zero_cna_values, bins=np.linspace(-2.25, 2.25, 10))

In [None]:
df_cna['NF1'].hist()

### Gene and Sample L2 Norms

In [None]:
cna_gene_l2 = genie.get_cna_norms(df_cna, axis=0).sort_values()
cna_samp_l2 = genie.get_cna_norms(df_cna, axis=1).sort_values()

In [None]:
cna_gene_l2.tail(30)

In [None]:
cna_gene_l2["NF1"], cna_gene_l2["NF2"], cna_gene_l2["SMARCB1"], cna_gene_l2["LZTR1"]

In [None]:
np.log10(cna_gene_l2["NF1"])

### DIstribution of Gene Vector Norms

Gene vector length is related to how many samples show copy number deviation from 0. 

In [None]:
ax = np.log10(cna_gene_l2 + 1).hist(bins=30, figsize=(8,8))
ax.axvline(x=np.log10(cna_gene_l2["NF1"] + 1), ymin=0, ymax=1, color='red')
ax.axvline(x=np.log10(cna_gene_l2["NF2"] + 1), ymin=0, ymax=1, color='orange', ls='--')

### Distribution of Sample Vector Norms

Sample vector length is related to how many genes show copy number deviaion from 0

In [None]:
ax = np.log10(cna_samp_l2 + 1).hist(bins=30, figsize=(8,4))

In [None]:
ax = cna_samp_l2.hist(bins=50, figsize=(8,4))

### Melted CNA

In [None]:
df_cna_melted = genie.get_melted_cna(df_cna)

In [None]:
df_cna_melted

In [None]:
ser_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(lambda x: list(zip(x['hugo'], x['dcna'])))

In [None]:
ser_cna_tokens

In [None]:
ser_cna_tokens.apply(lambda x: len(x)).hist(bins=50)

# How Many Samples With Mutations + CNA data? 

In [None]:
df_var['has_CNA'] = df_var['SAMPLE_ID'].isin(df_cna.index)

In [None]:
df_var['has_CNA'].sum()

In [None]:
df_var.shape

In [None]:
for gene in genie.NF_HUGO_SYMBOLS:
    print("Gene: ", gene)
    
    df = df_var[df_var['Hugo_Symbol'].isin([gene])]
    df_cna = df[df['has_CNA']]
    print("  num_variants: ", df.shape[0], df_cna.shape[0])
    print("  num_samples:  ", df['SAMPLE_ID'].nunique(), df_cna['SAMPLE_ID'].nunique())
    print("  num_patients: ", df['PATIENT_ID'].nunique(), df_cna['PATIENT_ID'].nunique())

In [None]:
for onco in genie.NF_ONCOTREE_CODES:
    print("Oncotree code: ", onco)
    
    df = df_var[df_var['ONCOTREE_CODE'].isin([onco])]
    df_cna = df[df['has_CNA']]
    print("  num_variants: ", df.shape[0], df_cna.shape[0])
    print("  num_samples:  ", df['SAMPLE_ID'].nunique(), df_cna['SAMPLE_ID'].nunique())
    print("  num_patients: ", df['PATIENT_ID'].nunique(), df_cna['PATIENT_ID'].nunique())

# Data CNA seg (Segmented Copy Number Data) 

https://docs.cbioportal.org/file-formats/#segmented-data

https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg

https://software.broadinstitute.org/software/igv/SEG

In [None]:
syn_file_paths['data_cna_hg19_seg']

In [None]:
df_seg = genie.read_cna_seg(syn_file_paths['data_cna_hg19_seg'])

In [None]:
df_seg

In [None]:
df_seg['ID'].isin(df_dcs['SAMPLE_ID']).all()