# Explore some data


# GENIE Datasets

### Synapse Links

* GENIE 13.3 consortium https://www.synapse.org/#!Synapse:syn36709873
  * data_clinical_patient https://www.synapse.org/#!Synapse:syn36710136
  * data_clinical_sample https://www.synapse.org/#!Synapse:syn36710137
  * data_mutations_extended https://www.synapse.org/#!Synapse:syn36710142
  * data_CNA https://www.synapse.org/#!Synapse:syn36710134
  * data_cna_hg19_seg https://www.synapse.org/#!Synapse:syn36710143
* GENIE 12.0 public https://www.synapse.org/#!Synapse:syn32309524
  * data_clinical_patient https://www.synapse.org/#!Synapse:syn32689054
  * data_clinical_sample https://www.synapse.org/#!Synapse:syn32689057
  * data_mutations_extended https://www.synapse.org/#!Synapse:syn32689317
  * data_CNA https://www.synapse.org/#!Synapse:syn32689019
  * data_cna_hg19_seg https://www.synapse.org/#!Synapse:syn32689379
  
### Data Format Explanations

* https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/
* https://docs.cbioportal.org/file-formats/#mutation-data
* https://docs.cbioportal.org/file-formats/#discrete-copy-number-data
* https://docs.cbioportal.org/file-formats/#segmented-data
  
# Challenge 3 FAQ Datasets

### Drug Screening of pNF Cell Lines

* wiki https://www.synapse.org/#!Synapse:syn4939906/wiki/235909

* Single Agent Screens https://www.synapse.org/#!Synapse:syn5522627
* 10x10 Combination Screens https://www.synapse.org/#!Synapse:syn5611797


### Synodos NF2

* wiki https://www.synapse.org/#!Synapse:syn2343195/wiki/62125

* SynodosNF2 DrugScreening Raw Processed Bundle https://www.synapse.org/#!Synapse:syn16815129


# NF Subtypes

https://hack4nf.slack.com/archives/C046RR7JFUL/p1665859968638679

```
Hi Lars, there is a predefined taxonomy of NF1 (e.g. MPNST, ANNUB, pNF, cNF, JMML, low grade glioma,), NF2 (e.g. schwannoma, meningioma), Schwannomatosis (e.g. schwannoma) tumor types.  Some tumors are connected too - eg pNF is thought to progress to ANNUBP which is thought to progress to MPNST, or there are subtypes of schwannoma - plexiform, melanotic, etc.
```

### NF1

Accronyms / subtypes

* MPNST: malignant peripheral nerve sheath tumors
* ANNUBP: atypical neurofibromatous neoplasms of uncertain biologic potential
* PNF: plexiform neurofibromas
* CNF: cutaneous and subcutaneous neurofibromas
* JMML: juvenile myelomonocytic leukemias
* GIST: gastrointestinal stromal tumors
* OPG: optical pathway gliomas
* SC: schwann cell

Oncotree codes

* NST: Nerve Sheath Tumor
  * MPNST: Malignant Peripheral Nerve Sheath Tumor
  * NFIB: Neurofibroma
  * SCHW: Schwannoma
    * CSCHW: Cellular Schwannoma
    * MSCHW: Melanotic Schwannoma
 


## Inspiration

[The Consensus Molecular Subtypes of Colorectal Cancer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4636487/)

[Genotype-phenotype correlation in neurofibromatosis type-1: NF1 whole gene deletions lead to high tumor-burden and increased tumor-growth](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8099117/)

[Emerging therapeutic targets for neurofibromatosis type 1 (NF1)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7017752/)

[Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival](https://www.pnas.org/doi/10.1073/pnas.1102826108)

https://github.com/scikit-tda/kepler-mapper/tree/master/examples

https://kepler-mapper.scikit-tda.org/en/latest/generated/gallery/plot_breast_cancer.html

[Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery](https://www.frontiersin.org/articles/10.3389/fgene.2018.00682/full)

https://github.com/zeochoy/tcga-embedding/blob/master/train.py

In [None]:
from collections import Counter
import json
from pathlib import Path
import sys
from timeit import default_timer

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import plotly.express as px

In [None]:
from hack4nf.synapse import get_dataset
from hack4nf.genie_utils import (
    read_clinical_patient, 
    read_clinical_sample, 
    read_mutations_extended,
    read_cna,
    read_cna_seg,
    SYNIDS,
    dme_to_cravat,
    get_cna_norms,
    get_melted_cna,
)

In [None]:
#from IPython.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
pd.set_option('display.max_columns', 100)

In [None]:
genie_dataset_version = "genie_12.0"
#genie_dataset_version = "genie_13.3"

# GENIE - Data Clinical Patient

In [None]:
syn_file = get_dataset(SYNIDS[genie_dataset_version]['data_clinical_patient'])
df_dcp = read_clinical_patient(syn_file.path)

In [None]:
df_dcp

# GENIE - Data Clinical Sample

In [None]:
syn_file = get_dataset(SYNIDS[genie_dataset_version]['data_clinical_sample'])
df_dcs = read_clinical_sample(syn_file.path)

In [None]:
df_dcs

In [None]:
np.linspace(0.5,20.5,21)

### Samples per Patient

In [None]:
_ = (
    df_dcs.groupby('PATIENT_ID')
    .size()
    .sort_values()
    .hist(
        bins=np.linspace(0.5, 20.5, 21),
        log=True,
        figsize=(8,8),
    )
)

# GENIE - Data Mutations Extended

In [None]:
syn_file = get_dataset(SYNIDS[genie_dataset_version]['data_mutations_extended'])
df_dme = read_mutations_extended(syn_file.path)

In [None]:
df_dme

### Variants per Sample

In [None]:
_ = (
    df_dme.groupby('Tumor_Sample_Barcode')
    .size()
    .sort_values()
    .hist(
        bins=np.linspace(0.5, 80.5, 81), 
        log=True, 
        figsize=(8,8),
    )
)

### OpenCravat from GENIE

https://opencravat.org/

NOTE: GENIE uses hg19/GRCh37 reference genome. 


Need tab seperated values. 
```
CHROM   POS         STRAND  REF ALT INDIVIDUAL
chr17   31327718    +       C   T   40442
```

In [None]:
df_cravat = dme_to_cravat(df_dme)

In [None]:
df_cravat

# GENIE - Merge Patient/Sample/Mutations

In [None]:
df_dme_mrg = pd.merge(
    df_dme, 
    df_dcs, 
    left_on="Tumor_Sample_Barcode", 
    right_on="SAMPLE_ID",
)
df_dme_mrg = pd.merge(
    df_dme_mrg, 
    df_dcp,
    on="PATIENT_ID",
)

In [None]:
df_dme_mrg

# GENIE - Data CNA (Discrete Copy Number Alteration Data)

https://docs.cbioportal.org/file-formats/#discrete-copy-number-data

For each gene-sample combination, a copy number level is specified:

* "-2" is a deep loss, possibly a homozygous deletion
* "-1" is a single-copy loss (heterozygous deletion)
* "0" is diploid
* "1" indicates a low-level gain
* "2" is a high-level amplification.

In [None]:
syn_file = get_dataset(SYNIDS[genie_dataset_version]['data_CNA'])
df_cna = read_cna(syn_file.path)

In [None]:
df_cna

### Sparsity

In [None]:
all_cna_values = df_cna.values.flatten()
non_null_cna_values = all_cna_values[~np.isnan(all_cna_values)]
non_zero_cna_values = non_null_cna_values[non_null_cna_values!=0]
set(non_zero_cna_values)

In [None]:
Counter(non_null_cna_values).most_common()

In [None]:
print('num_samp x num_gene: ', df_cna.shape)
print('total values: ', df_cna.size)
print('non null values: ', len(non_null_cna_values), len(non_null_cna_values)/df_cna.size)
print('non zero_values: ', len(non_zero_cna_values), len(non_zero_cna_values)/df_cna.size)

### DIstribution of non-null / non-zero values

In [None]:
_ = plt.hist(non_zero_cna_values, bins=np.linspace(-2.25, 2.25, 10))

### Gene and Sample L2 Norms

In [None]:
cna_gene_l2 = get_cna_norms(df_cna, axis=0).sort_values()
cna_samp_l2 = get_cna_norms(df_cna, axis=1).sort_values()

In [None]:
cna_gene_l2.tail(30)

In [None]:
cna_gene_l2["NF1"], cna_gene_l2["NF2"], cna_gene_l2["SMARCB1"], cna_gene_l2["LZTR1"]

In [None]:
np.log10(cna_gene_l2["NF1"])

### DIstribution of Gene Vector Norms

Gene vector length is related to how many samples show copy number deviation from 0. 

In [None]:
ax = np.log10(cna_gene_l2 + 1).hist(bins=30, figsize=(8,8))
ax.axvline(x=np.log10(cna_gene_l2["NF1"] + 1), ymin=0, ymax=1, color='red')
ax.axvline(x=np.log10(cna_gene_l2["NF2"] + 1), ymin=0, ymax=1, color='orange', ls='--')

### Distribution of Sample Vector Norms

Sample vector length is related to how many genes show copy number deviaion from 0

In [None]:
ax = np.log10(cna_samp_l2 + 1).hist(bins=30, figsize=(8,4))

In [None]:
ax = cna_samp_l2.hist(bins=50, figsize=(8,4))

### Melted CNA

In [None]:
df_cna_melted = get_melted_cna(df_cna)

In [None]:
df_cna_melted

In [None]:
df_cna_tokens = df_cna_melted.groupby('SAMPLE_ID').apply(lambda x: list(zip(x['hugo'], x['dcna'])))

In [None]:
df_cna_tokens

# Data CNA seg (Segmented Copy Number Data) 

https://docs.cbioportal.org/file-formats/#segmented-data

https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg

https://software.broadinstitute.org/software/igv/SEG

In [None]:
syn_file = get_dataset(SYNIDS[genie_dataset_version]['data_cna_hg19_seg'])
df_seg = read_cna_seg(syn_file.path)

In [None]:
df_seg

In [None]:
df_seg['ID'].isin(df_dcs['SAMPLE_ID']).all()

# Drug Screens pNF

In [None]:
synfile = get_dataset("syn5522642")

In [None]:
synfile.path

In [None]:
df_hts = pd.read_csv(synfile.path)

In [None]:
df_hts

In [None]:
plt_data = {
    "NCGC protocol": [],
    "NCGC SID": [],
    "Cell line": [],
    "C": [],
    "D": [],
    'name': [],
    'target': [],
}
for ii,row in df_hts.iterrows():
    for i in range(11):
        plt_data['NCGC protocol'].append(row["NCGC protocol"])
        plt_data['NCGC SID'].append(row["NCGC SID"])
        plt_data["Cell line"].append(row["Cell line"])
        plt_data['C'].append(row[f"C{i}"])
        plt_data['D'].append(row[f"DATA{i}"])
        plt_data['name'].append(row['name'])
        plt_data['target'].append(row['target'])

In [None]:
row

In [None]:
df_plt = pd.DataFrame(plt_data).sort_values(["NCGC SID", "C"])
df_plt.head(40)

In [None]:
fig = px.line(
    df_plt.head(11*4), 
    x="C", 
    y="D", 
    color="NCGC SID", 
    log_x=True,
    hover_data=['name', 'target', "NCGC protocol", "Cell line"],
)
fig.update_layout(showlegend=False)
fig.update_traces(line=dict(width=0.5))
fig.show()