# Introduction to cBioPortal REST API

This [Jupyter notebook](https://jupyter.org/) gives examples on how to use the [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) web service from [cBioPortal](https://www.cbioportal.org) as well as other webservices from the [Knowledge Systems Group at MSKCC](https://www.mskcc.org/research-areas/labs/nikolaus-schultz). We will pull data from those APIs to make visualizations.

## How to run the notebook

This notebook can be executed on your own machine after installing Jupyter. Please install the Python 3 version of anaconda: https://www.anaconda.com/download/. After having that set up you can install Jupyter with:

```bash
conda install jupyter

```

For these examples we also require the [Swagger API](https://swagger.io/specification/) client `bravado`.

```bash
conda install -c conda-forge bravado
```

And the popular data analysis libraries pandas, matplotlib and seaborn:

```
conda install pandas matplotlib seaborn
```

Then clone this repo:

```
git clone https://github.com/cbioportal/workbench
```

And run Jupyter in this folder
```
cd workbench/intro
jupyter notebook
```
That should open Jupyter in a new browser window and you should be able to open this notebook using the web interface. You can then follow along with the next steps.

## How to use the notebook

The notebook consists of cells which can be executed by clicking on one and pressing shift+f. In the toolbar at the top there is a dropdown which indicates what type of cell you have selected e.g. `Code` or [Markdown](https://en.wikipedia.org/wiki/Markdown). The former will be executed as raw Python code the latter is a markup language and will be run through a Markdown parser. Both generate HTML that will be printed directly to the notebook page.

There a few keyboard shortcuts that are good to know. That is: `b` creates a new cell below the one you've selected and `a` above the one you selected. Editing a cell can be done with a single click for a code cell and a double click for a Markdown cell. A complete list of all keyboard shortcuts can be found by pressing the keyboard icon in the toolbar at the top.

 Give it a shot by editing one of the cells and pressing shift+f.

## Using the REST APIs

All [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) web services from the [Knowledge Systems Group](https://www.mskcc.org/research-areas/labs/nikolaus-schultz) we will be using in this tutorial have their REST APIs defined following the [Open API / Swagger specification](https://swagger.io/specification/). This allows us to use `bravado` to connect to them directly, and explore the API interactively.

For example this is how to connect to the [cBioPortal](https://www.cbioportal.org) API:

In [27]:
from bravado.client import SwaggerClient

cbioportal = SwaggerClient.from_url('https://www.cbioportal.org/api/v2/api-docs',
                                config={"validate_requests":False,
                                        "validate_responses":False,
                                       "validate_swagger_spec": False})
print(cbioportal)

SwaggerClient(https://www.cbioportal.org/api)


In [28]:
%config Completer.use_jedi = False

You can now explore the API by using code completion, press `Tab` after typing `cbioportal.`:

In [29]:
cancerTypeId='wt'
sortBy='studyId'
keyword='thyroid'
studyId='thca_tcga'
cancer_types=cbioportal.Cancer_Types.getAllCancerTypesUsingGET().result()
studies=cbioportal.Studies.getAllStudiesUsingGET().result()

In [30]:
studies=cbioportal.Studies.getAllStudiesUsingGET(sortBy='importDate',projection='SUMMARY').result()

print("There are {} studies to look at overall, and each must be separated by year.".format(
    len(studies)
))

There are 418 studies to look at overall, and each must be separated by year.


In [31]:
cbioportal.Studies.getAllStudiesUsingGET(projection="SUMMARY").result()

[CancerStudy(allSampleCount=12, cancerType=None, cancerTypeId='acbc', citation='Martelotto et al. J Pathol 2015', cnaSampleCount=None, completeSampleCount=None, description='Whole exome sequencing of 12 breast AdCCs.', groups='ACYC;PUBLIC', importDate='2023-12-06 19:10:36', massSpectrometrySampleCount=None, methylationHm27SampleCount=None, miRnaSampleCount=None, mrnaMicroarraySampleCount=None, mrnaRnaSeqSampleCount=None, mrnaRnaSeqV2SampleCount=None, name='Adenoid Cystic Carcinoma of the Breast (MSK, J Pathol. 2015)', pmid='26095796', publicStudy=True, readPermission=True, referenceGenome='hg19', rppaSampleCount=None, sequencedSampleCount=None, status=0, studyId='acbc_mskcc_2015', treatmentCount=None),
 CancerStudy(allSampleCount=60, cancerType=None, cancerTypeId='acyc', citation='Ho et al. Nat Genet 2013', cnaSampleCount=None, completeSampleCount=None, description='Whole-exome or whole-genome sequencing analysis of 60 ACC tumor/normal pairs', groups='ACYC;PUBLIC', importDate='2023-12-

In [32]:
studies = cbioportal.Studies.getAllStudiesUsingGET().result()
{k:getattr(studies, k) for k in dir(studies)}

{'__add__': <method-wrapper '__add__' of list object at 0x166364640>,
 '__class__': list,
 '__class_getitem__': <function list.__class_getitem__>,
 '__contains__': <method-wrapper '__contains__' of list object at 0x166364640>,
 '__delattr__': <method-wrapper '__delattr__' of list object at 0x166364640>,
 '__delitem__': <method-wrapper '__delitem__' of list object at 0x166364640>,
 '__dir__': <function list.__dir__()>,
 '__doc__': 'Built-in mutable sequence.\n\nIf no argument is given, the constructor creates a new empty list.\nThe argument must be an iterable if specified.',
 '__eq__': <method-wrapper '__eq__' of list object at 0x166364640>,
 '__format__': <function list.__format__(format_spec, /)>,
 '__ge__': <method-wrapper '__ge__' of list object at 0x166364640>,
 '__getattribute__': <method-wrapper '__getattribute__' of list object at 0x166364640>,
 '__getitem__': <function list.__getitem__(index, /)>,
 '__getstate__': <function list.__getstate__()>,
 '__gt__': <method-wrapper '__g

In [33]:
import pandas as pd

df = pd.DataFrame.from_dict([
dict(
    {k:getattr(s,k) for k in dir(s)},
    **{k:getattr(s.studyId,k) for k in dir(s.studyId)})
    for s in studies
])

In [34]:
df.columns

Index(['allSampleCount', 'cancerType', 'cancerTypeId', 'citation',
       'cnaSampleCount', 'completeSampleCount', 'description', 'groups',
       'importDate', 'massSpectrometrySampleCount',
       ...
       'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
       'title', 'translate', 'upper', 'zfill'],
      dtype='object', length=106)

In [35]:
df['name citation studyId allSampleCount'.split()].head()

Unnamed: 0,name,citation,studyId,allSampleCount
0,"Adenoid Cystic Carcinoma of the Breast (MSK, J...",Martelotto et al. J Pathol 2015,acbc_mskcc_2015,12
1,"Adenoid Cystic Carcinoma (MSK, Nat Genet 2013)",Ho et al. Nat Genet 2013,acyc_mskcc_2013,60
2,"Adenoid Cystic Carcinoma (FMI, Am J Surg Pathl...",Ross et al. Am J Surg Pathl 2014,acyc_fmi_2014,28
3,"Adenoid Cystic Carcinoma (JHU, Cancer Prev Res...","Rettig et al, Cancer Prev Res 2016",acyc_jhu_2016,25
4,"Adenoid Cystic Carcinoma (MDA, Clin Cancer Res...",Mitani et al. Clin Cancer Res 2015,acyc_mda_2015,102


In [42]:
import pandas as pd

# Sample DataFrame (with a potential issue)
data = {
    "citation": [
        'Adenoid Cystic Carcinoma of the Breast (MSK, J Pathol. 2015)', 
        'Adenoid Cystic Carcinoma (MSK, Nat Genet 2013)', 
        'Adenoid Cystic Carcinoma (FMI, Am J Surg Pathl. 2014)', 
        'Adenoid Cystic Carcinoma (JHU, Cancer Prev Res 2016)', 
        'Adenoid Cystic Carcinoma (MDA, Clin Cancer Res 2015)', 
        'Adenoid Cystic Carcinoma (MGH, Nat Gen 2016)', 
        'Adenoid Cystic Carcinoma (Sanger/MDA, JCI 2013)', 
        'Adenoid Cystic Carcinoma Project (J Clin Invest 2019)', 
        'Basal Cell Carcinoma (UNIGE, Nat Genet 2016)', 
        'Acute Lymphoblastic Leukemia (St Jude, Nat Genet 2015)', 
        'Ampullary Carcinoma (Baylor College of Medicine, Cell Reports 2016)', 
        'Hypodiploid Acute Lymphoid Leukemia (St Jude, Nat Genet 2013)', 
        'Acute Lymphoblastic Leukemia (St Jude, Nat Genet 2016)', 
        'The Angiosarcoma Project - Count Me In (Nature Medicine, 2020)', 
        'Breast Fibroepithelial Tumors (Duke-NUS, Nat Genet 2015)', 
        'Acute Myeloid Leukemia (OHSU, Nature 2018)', 
        'Appendiceal Cancer (MSK, J Clin Oncol 2022)', 
        'Metastatic Biliary Tract Cancers (SUMMIT - Neratinib Basket Trial, 2022)', 
        'Acute Myeloid Leukemia (OHSU, Cancer Cell 2022)', 
        'Bladder Urothelial Carcinoma (DFCI/MSK, Cancer Discov 2014)', 
        'Bladder Cancer (MSK, J Clin Onco 2013)', 
        'Bladder Urothelial Carcinoma (BGI, Nat Genet 2013)', 
        'Bladder Cancer (MSK, Eur Urol 2014)', 
        'Nonmuscle Invasive Bladder Cancer (MSK Eur Urol 2017)',
        'Bladder Cancer (MSK, Nat Genet 2016)', 
        'Breast Invasive Carcinoma (Broad, Nature 2012)', 
        'Bladder Urothelial Carcinoma (TCGA, Nature 2014)', 
        'Pediatric Brain Cancer (CPTAC/CHOP, Cell 2020)', 
        'Colorectal Adenocarcinoma (MSK, Nat Commun 2022)', 
        'Metastatic Breast Cancer (INSERM, PLoS Med 2016)', 
        'Breast Cancer (MSK, NPJ Breast Cancer 2019)', 
        'Juvenile Papillomatosis and Breast Cancer (MSK,J Pathol. 2015)', 
        'MAPK on resistance to anti-HER2 therapy for breast cancer (MSK, Nat Commun. 2022)', 
        'Breast Cancer (MSK, Clinical Cancer Res 2020)', 
        'Proteogenomic landscape of breast cancer (CPTAC, Cell 2020)', 
        'Breast Cancer (HTAN, 2022)', 'Breast Invasive Carcinoma (Sanger, Nature 2012)', 
        'Breast Invasive Carcinoma (TCGA, Nature 2012)', 
        'Cancer Cell Line Encyclopedia (Novartis/Broad, Nature 2012)', 
        'Kidney Renal Clear Cell Carcinoma (IRC, Nat Genet 2014)', 
        'Renal Clear Cell Carcinoma (UTokyo, Nat Genet 2013)', 
        'NCI-60 Cell Lines (NCI, Cancer Res 2012)', 
        'Breast Cancer (MSK, Cancer Cell 2018)', 
        'Cancer Cell Line Encyclopedia (Broad, 2019)', 
        'Breast Cancer (MSK, Nature Cancer 2020)', 
        'Breast Cancer (SMC 2018)', 
        'Clear Cell Renal Cell Carcinoma (DFCI, Science 2019)', 
        'Metastatic Breast Cancer (MSK, Cancer Discovery 2022)', 
        'Intrahepatic Cholangiocarcinoma (JHU, Nat Genet 2013)', 
        'Cholangiocarcinoma (National Cancer Centre of Singapore, Nat Genet 2013)', 
        'Cholangiocarcinoma (National University of Singapore, Nat Genet 2012)', 
        'Chronic Lymphocytic Leukemia (IUOPA, Nature 2015)', 
        'Chronic Lymphocytic Leukemia (Broad, Nature 2015)', 
        'Chronic lymphocytic leukemia (ICGC, Nature Genetics 2011)', 
        'Colon Adenocarcinoma (CaseCCC, PNAS 2015)', 
        'Cholangiocarcinoma (MSK, Clin Cancer Res 2018)', 
        'Cholangiocarcinoma (ICGC, Cancer Discov 2017)', 
        'Colorectal Adenocarcinoma (TCGA, Firehose Legacy)', 
        'Colorectal Adenocarcinoma (Genentech, Nature 2012)', 
        'Colorectal Adenocarcinoma Triplets (MSK, Genome Biol 2014)', 
        'Cutaneous Squamous Cell Carcinoma (DFCI, Clin Cancer Res 2015)', 
        'Colorectal Adenocarcinoma (DFCI, Cell Reports 2016)',
        'Cutaneous Squamous Cell Carcinoma (MD Anderson, Clin) Cancer Res 2014)',
        'Colon Cancer (CPTAC-2 Prospective, Cell 2019)',
        'Colorectal Cancer (MSK, JNCI 2021)',
        'Colorectal Cancer (MSK, Gastroenterology 2020)',
        'Disparities in metastatic colorectal cancer between Africans and Americans (MSK, 2020)',
        'Colorectal Cancer (MSK, JCO Precis Oncol 2022)',
        'Colorectal Cancer (MSK, Cancer Discovery 2022)',
        'Desmoplastic Melanoma (Broad Institute, Nat Genet 2015)',
        'Gastric Adenocarcinoma (TMUCIH, PNAS 2015)',
        'Cutaneous T Cell Lymphoma (Columbia U, Nat Genet 2015)',
        'Diffuse Large B-Cell Lymphoma (Broad, PNAS 2012)',
        'Metastatic Esophagogastric Cancer (MSK, Cancer Discovery 2017)',
        'Diffuse Large B-Cell Lymphoma (Duke, Cell 2017)',
        'Diffuse Large B Cell Lymphoma (DFCI, Nat Med 2018)',
        'Esophageal/Stomach Cancer (MSK, 2020)',
        'Cutaneous Squamous Cell Carcinoma (UCSF, NPJ Genom Med 2021)',
        'Esophagogastric Cancer (MSK, Clin Cancer Res 2022)',
        'Esophageal Squamous Cell Carcinoma (UCLA, Nat Genet 2014)',
        'Esophageal Adenocarcinoma (DFCI, Nat Genet 2013)',
        'Esophageal Squamous Cell Carcinoma (ICGC, Nature 2014)',
        'Pediatric Ewing Sarcoma (DFCI, Cancer Discov 2014)',
        'Ewing Sarcoma (Institut Curie, Cancer Discov 2014)',
        'Gallbladder Carcinoma (Shanghai, Nat Genet 2014)',
        'Glioblastoma (TCGA, Nature 2008)',
        'Glioblastoma (TCGA, Cell 2013)',
        'Germ Cell Tumors (MSK, J Clin Oncol 2016)',
        'Gallbladder Cancer (MSK, Cancer 2018)' 'Brain Tumor PDXs (Mayo Clinic, Clin Cancer Res 2020)',
        'Esophageal Cancer-TRAP Project (MSK, Lancet Oncol 2020)',
        'Glioblastoma (Columbia, Nat Med. 2019)',
        'Glioblastoma (CPTAC, Cell 2021)',
        'Gallbladder Cancer (MSK, 2022)',
        'Recurrent and Metastatic Head & Neck Cancer (MSK, JAMA Oncol 2016)',
        'Hepatocellular Carcinomas (INSERM, Nat Genet 2015)',
        'Hepatocellular Carcinoma (MSK, Clin Cancer Res 2018)',
        'Liver Hepatocellular Adenoma and Carcinomas (MSK, PLOS One 2018)',
        'Glioma (MSK, Clin Cancer Res 2019)',
        'Glioma (MSK, Nature 2019)',
        'Histiocytosis Cobimetinib (MSK, Nature 2019)',
        'Germ Cell Tumors and Shared Leukemias (MSK, J Clin Invest 2020)',
        'Combined Hepatocellular and Intrahepatic Cholangiocarcinoma (Peking University, Cancer Cell 2019)',
        'High-Grade Serous Ovarian Cancer (MSK, NPJ Genome Med 2021)',
        'Hepatocellular Carcinoma (MERiC/Basel, Nat Commun. 2022)',
        'MSK-IMPACT Heme Tumors (MSK, 2022)',
        'Gastrointestinal Stromal Tumor (MSK, NPJ Precis Oncol 2023)',
        'Kidney Renal Clear Cell Carcinoma (BGI, Nat Genet 2012)',
        'Kidney Renal Clear Cell Carcinoma (TCGA, Nature 2013)',
        'Kidney Chromophobe (TCGA, Cancer Cell 2014)',
        'Head and Neck Squamous Cell Carcinoma (Broad, Science 2011)',
        'Head and Neck Squamous Cell Carcinoma (Johns Hopkins, Science 2011)',
        'Head and Neck Squamous Cell Carcinoma (TCGA, Nature 2015)',
        'Oral Squamous Cell Carcinoma (MD Anderson, Cancer Discov 2013)',
        'Chronic Lymphocytic Leukemia (Broad, Cell 2013)',
        'Intrahepatic Cholangiocarcinoma (Shanghai, Nat Commun 2014)',
        'Intrahepatic Cholangiocarcinoma (MSK, Clin Cancer Res 2021)',
        'Intrahepatic Cholangiocarcinoma (Mount Sinai 2015)',
        'Intrahepatic Cholangiocarcinoma (MSK, Hepatology 2021)',
        'Low-Grade Gliomas (UCSF, Science 2014)',
        'Merged Cohort of LGG and GBM (TCGA, Cell 2016)',
        'Liver Hepatocellular Carcinoma (AMC, Hepatology 2014)',
        'Liver Hepatocellular Carcinoma (RIKEN, Nat Genet 2012)',
        'Lung Adenocarcinoma (MSK, Science 2015)',
        'Lung Adenocarcinoma (Broad, Cell 2012)',
        'Lung Adenocarcinoma (TCGA, Nature 2014)',
        'Lung Adenocarcinoma (TSP, Nature 2008)',
        'Hepatocellular Adenoma (INSERM, Cancer Cell 2014)',
        'Non-Small Cell Cancer (MSK, Cancer Discov 2017)',
        'Lung Adenocarcinoma (MSK, J Thorac Oncol 2020)',
        'Lung Adenocarcinoma Met Organotropism (MSK, Cancer Cell 2023)',
        'Lung Adenocarcinoma (MSK, NPJ Precision Oncology 2021)'
        'Lung Adenocarcinoma (MSK, 2021)'
        'Lung Adenocarcinoma (CPTAC, Cell 2020)'
        'Low-Grade Serous Ovarian Cancer (MSK, Clin Cancer Res 2022)' 
        'Lung Squamous Cell Carcinoma (TCGA, Nature 2012)'
        'Medulloblastoma (Broad, Nature 2012)'
        'Medulloblastoma (ICGC, Nature 2012)'
        'Medulloblastoma (PCGP, Nature 2012)'
        'Mantle Cell Lymphoma (IDIBIPS, PNAS 2013)'
        'Myelodysplasia (UTokyo, Nature 2011)'
        'Medulloblastoma (Sickkids, Nature 2016)'
        'Metastatic Melanoma (UCLA, Cell 2016)'
        'Thoracic PDX (MSK, Provisional)'
        'Lymphoma Cell Lines (MSK, Blood 2020)'
        'Myelodysplastic (MSK, 2020)'
        'Lung Cancer (SMC, Cancer Research 2016)'
        'Metaplastic Breast Cancer (MSK, NPJ Breast Cancer 2021)'
        'Melanoma (MSK, Clin Cancer Res 2021)'
        'Thoracic Cancer (MSK, Nat Commun 2021)'
        'Medulloblastoma (DKFZ, Nature 2017)'
        'Lung Squamous Cell Carcinoma (CPTAC, Cell 2021)'
        'Lung Cancer in Never Smokers (NCI, Nature Genetics 2021)'
        'Myelodysplastic Syndromes (MDS IWG, IPSSM, NEJM Evidence 2022)'
        'Multiple Myeloma (Broad, Cancer Cell 2014)'
        'Malignant Peripheral Nerve Sheath Tumor (MSK, Nat Genet 2014)'
        'Neuroblastoma (AMC Amsterdam, Nature 2012)'
        'Rhabdoid Cancer (BCGSC, Cancer Cell 2016)'
        'MSK-IMPACT Clinical Sequencing Cohort (MSK, Nat Med 2017)'
        'Pediatric Pan-cancer (Columbia U, Genome Med 2016)'
        'MSS Mixed Solid Tumors (Broad/Dana-Farber, Nat Genet 2018)'
        'Myeloproliferative Neoplasms (CIMR, NEJM 2013)'
        'Acute myeloid leukemia or myelodysplastic syndromes (WashU, 2016)'
        'Metastatic Solid Cancers (UMich, Nature 2017)'
        'Mixed Tumors Selpercatinib RET Trial (MSK, Nat Commun. 2022)'
        'Mixed cfDNA (MSK, Genome Med 2021)'
        'Cancer Therapy and Clonal Hematopoiesis (MSK, Nat Genet 2020)'
        'MSK MetTropism (MSK, Cell 2021)'
        'MSK-IMPACT and MSK-ACCESS Mixed Cohort (MSK, Nat Commun 2021)'
        'The Metastatic Prostate Cancer Project (Provisional, June 2021)'
        'Cancer Therapy and Clonal Hematopoiesis (MSK, Clin Cancer Res 2022)'
        'Pediatric Pancan Tumors (MSK, Nat Commun. 2022)'
        'Pan-Cancer MSK-IMPACT MET Validation Cohort (MSK 2022)'
        'Renal Non-Clear Cell Carcinoma (Genentech, Nat Genet 2014)'
        'Nasopharyngeal Carcinoma (Singapore, Nat Genet 2014)'
        'Neuroendocrine Prostate Cancer (Multi-Institute, Nat Med 2016)' 
        'Neuroblastoma (Broad, Nature 2015)'
        'Pan-Lung Cancer (TCGA, Nat Genet 2016)'
        'Neuroblastoma (Broad, Nat Genet 2013)'
        'Non-Hodgkin Lymphoma (BCGSC, Nature 2011)'
        'Diffuse Large B-cell Lymphoma (BCGSC, Blood 2013)'
        'Anaplastic Oligodendroglioma and Anaplastic Oligoastrocytoma (MSK, Neuro Oncol 2017)'
        'Non-Small Cell Lung Cancer (University of Turin, Lung Cancer 2017)'
        'Non-Small Cell Lung Cancer (TRACERx, NEJM & Nature 2017)'
        'Pediatric Neuroblastoma (TARGET, 2018)'
        'Non-Small Cell Lung Cancer (MSK, J Clin Oncol 2018)'
        'Non-Small Cell Lung Cancer (MSK, Science 2015)'
        'Non-Small Cell Lung Cancer (MSK, Cancer Cell 2018)'
        'Metastatic Non-Small Cell Lung Cancer (MSK, Nature Medicine 2022)'
        'Pediatric Neuroblastoma (MSK, Nat Genet 2023)'
        'Ovarian Serous Cystadenocarcinoma (TCGA, Nature 2011)'
        'Acinar Cell Carcinoma of the Pancreas (JHU, J Pathol 2014)'
        'Pancreatic Adenocarcinoma (ICGC, Nature 2012)'
        'Pancreatic Cancer (UTSW, Nat Commun 2015)'
        'Pancreatic Neuroendocrine Tumors (Johns Hopkins University, Science 2011)'
        'Primary Central Nervous System Lymphoma (Mayo Clinic, Clin Cancer Res 2015)'
        'Insulinoma (Shanghai, Nat Commun 2013)'
        'Pleural Mesothelioma (NYU, Cancer Res 2015)'
        'Pancreatic Adenocarcinoma (QCMG, Nature 2016)'
        'Cystic Tumor of the Pancreas (Johns Hopkins, PNAS 2011)'
        'Pancreatic Neuroendocrine Tumors (Multi-Institute, Nature 2017)'
        'Pilocytic Astrocytoma (ICGC, Nature Genetics 2013)'
        'Pheochromocytoma and Paraganglioma (TCGA, Cell 2017)'
        'Pediatric Pan-Cancer (DKFZ, Nature 2017)'
        'Tumors with TRK fusions (MSK, Clin Cancer Res 2020)'
        'China Pan-cancer (OrigiMed, Nature 2022)'
        'Pancreatic Ductal Adenocarcinoma (CPTAC, Cell 2021)'
        'Bladder Cancer (MSK, Cell Reports 2022)'
        'Pancreatic Acinar Cell Carcinoma (MSK, JCO 2023)'
        'Prostate Adenocarcinoma (Fred Hutchinson CRC, Nat Med 2016)'
        'Prostate Adenocarcinoma (Broad/Cornell, Nat Genet 2012)'
        'Prostate Adenocarcinoma (Broad/Cornell, Cell 2013)'
        'Metastatic Prostate Adenocarcinoma (MCTP, Nature 2012)'
        'Prostate Adenocarcinoma (MSK, Cancer Cell 2010)'
        'Prostate Adenocarcinoma (MSK, PNAS 2014)'
        'Metastatic Prostate Cancer (SU2C/PCF Dream Team, Cell 2015)'
        'Prostate Adenocarcinoma (TCGA, Cell 2015)'
        'Prostate Adenocarcinoma (CPC-GENE, Nature 2017)'
        'Prostate Adenocarcinoma Organoids (MSK, Cell 2014)'
        'Metastatic castration-sensitive prostate cancer (MSK, Clin Cancer Res 2020)'
        'Retinoblastoma cfDNA (MSK, Cancer Med 2020)'
        'Race Differences in Prostate Cancer (MSK, 2021)'
        'Prostate Adenocarcinoma (MSK, Clin Cancer Res. 2022)'
        'Prostate Cancer Brain Metastases (Bern, Nat Commun. 2022)'
        'Skin Cutaneous Melanoma (TCGA, Firehose Legacy)'
        'Small Cell Lung Cancer (CLCGP, Nat Genet 2012)'
        'Small Cell Lung Cancer (Johns Hopkins, Nat Genet 2012)'
        'Skin Cutaneous Melanoma (Broad, Cell 2012)'
        'Melanoma (Broad/Dana Farber, Nature 2012)'
        'Small Cell Carcinoma of the Ovary (MSK, Nat Genet 2014)'
        'Rhabdomyosarcoma (NIH, Cancer Discov 2014)'
        'Sarcoma (MSKCC/Broad, Nat Genet 2010)'
        'Skin Cutaneous Melanoma (Broad, Cancer Discov 2014)'
        'Adult Soft Tissue Sarcomas (TCGA, Cell 2017)'
        'Melanoma (MSK, NEJM 2014)'
        'Small-Cell Lung Cancer (Multi-Institute, Cancer Cell 2017)'
        'Melanomas (TCGA, Cell 2015)'
        'Metastatic Melanoma (DFCI, Science 2015)'
        'Small Cell Lung Cancer (U Cologne, Nature 2015)'
        'Sarcoma (MSK, Nat Commun. 2022)'
        'Retinoblastoma (MSK, Cancers 2021)'
        'Rectal Cancer (MSK, Nature Medicine 2022)'
        'Skin Cutaneous Melanoma (Yale, Nat Genet 2012)'
        'Stomach Adenocarcinoma (Pfizer and UHK, Nat Genet 2014)'
        'Stomach Adenocarcinoma (TCGA, Nature 2014)' 
        'Stomach Adenocarcinoma (UHK, Nat Genet 2011)'
        'Stomach Adenocarcinoma (U Tokyo, Nat Genet 2014)'
        'Thymic Epithelial Tumors (NCI, Nat Genet 2014)'
        'Poorly-Differentiated and Anaplastic Thyroid Cancers (MSK, JCI 2016)'
        'Papillary Thyroid Carcinoma (TCGA, Cell 2014)'
        'Esophageal Carcinoma (TCGA, Nature 2017)'
        'Metastatic Melanoma (MSK, JCO Precis Oncol 2017)'
        'Prostate Cancer (MSK, JCO Precis Oncol 2017)'
        'Prostate Adenocarcinoma (SMMU, Eur Urol 2017)'
        'SUMMIT - Neratinib Basket Study (Multi-Institute, Nature 2018)'
        'Prostate Adenocarcinoma (MSK/DFCI, Nature Genetics 2018)'
        'TMB and Immunotherapy (MSK, Nat Genet 2019)'
        'Metastatic Prostate Adenocarcinoma (SU2C/PCF Dream Team, PNAS 2019)'
        'Prostate Cancer (MSK, Cell Metab 2020)'
        'Prostate Adenocarcinoma (MSK, Eur Urol 2020)'
        'Myoepithelial Carcinomas of Soft Tissue (WCM, CSH Molecular Case Studies 2022)'
        'Uterine Carcinosarcoma (Johns Hopkins, Nat Commun 2014)'
        'Uterine Corpus Endometrial Carcinoma (TCGA, Nature 2013)'
        'Uveal Melanoma (QIMR, Oncotarget 2016)'
        'Unclassified Renal Cell Carcinoma (MSK, Nature 2016)'
        'Upper Tract Urothelial Cancer (MSK, Eur Urol 2015)'
        "Pediatric Wilms' Tumor (TARGET, 2018)"
        'Endometrial Cancer (MSK, 2018)'
        'Uterine Clear Cell Carcinoma (NIH, Cancer 2017)'
        'Squamous Cell Carcinoma of the Vulva (CUK, Exp Mol Med 2018)'
        'Upper Tract Urothelial Carcinoma (Cornell/Baylor/MDACC, Nat Commun 2019)'
        'Upper Tract Urothelial Carcinoma (MSK, Nat Commun 2020)'
        'Upper Tract Urothelial Carcinoma PDX (MSK, Nat Commun 2020)'
        'Uterine Sarcoma/Mesenchymal (MSK, Clin Cancer Res 2020)'
        'Upper Tract Urothelial Carcinoma (IGBMC, Genome Biology 2021)'
        'Endometrial Carcinoma (CPTAC, Cell 2020)'
        'Endometrial Carcinoma MSI (MSK, Clin Cancer Res 2022)'
        'Breast Invasive Carcinoma (British Columbia, Nature 2012)'
        'Mixed cfDNA (MSK, Nature Medicine 2019)'
        'Non-CDH1 Invasive Lobular Carcinoma (MSK, 2023)'
        'Bladder Cancer (Columbia University/MSK, Cell 2018)'
        'Colon Cancer (Sidra-LUMC AC-ICAM, Nat Med 2023)'
        'Bladder Cancer (MSK/TCGA, 2020)'
        'MSK Make-an-IMPACT Rare Cancers (MSK, Clin Cancer Res 2023)'
        'Pediatric Rhabdomyosarcomas (MSK, JCO Precis Oncol 2023)'
        'Rhabdomyosarcomas (MSK, NPJ Precis Oncol 2023)'
        'Sarcoma (MSK, J Pathol 2023)'
        'Esophagogastric Cancer (MSK, Clin Cancer Res 2023)'
        'Hepatocellular Carcinoma (MSK, JCO Precis Oncol 2023)'
        'Endometrial Cancer (MSK, Cancer Discovery 2023)'
        'Ewing Sarcoma (MSK, 2023)'
        'Bladder Cancer (MSK, Clin Cancer Res 2023)'
        'Non-Small Cell Lung Cancer Brain Metastasis (MSK, Nat Commun 2023)'
        'Cervical Cancer (MSK, Clin Cancer Res 2023)'
        'Adrenocortical Carcinoma (TCGA, Firehose Legacy)'
        'Bladder Urothelial Carcinoma (TCGA, Firehose Legacy)'
        'Breast Cancer Xenografts (British Columbia, Nature 2015)', 
        'The Metastatic Breast Cancer Project (Archived, 2020)'
        'Pediatric Acute Lymphoid Leukemia - Phase II (TARGET, 2018)'
        'Pediatric Acute Myeloid Leukemia (TARGET, 2018)'
        'Bladder Cancer (TCGA, Cell 2017)'
        'Urothelial Carcinoma (Cornell/Trento, Nat Gen 2016)'
        'Urothelial Carcinoma (BCAN/HCRN, Nat Commun 2022)'
        'The Angiosarcoma Project (Provisional, July 2020)'
        'The Metastatic Breast Cancer Project (Provisional, December 2021)'
        'Breast Invasive Carcinoma (TCGA, Firehose Legacy)'
        'Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (TCGA, Firehose Legacy)'
        'Cholangiocarcinoma (TCGA, Firehose Legacy)'
        'Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (TCGA, Firehose Legacy)'
        'Esophageal Carcinoma (TCGA, Firehose Legacy)'
        'Glioblastoma Multiforme (TCGA, Firehose Legacy)'
        'Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy)'
        'Breast Invasive Carcinoma (TCGA, Cell 2015)'
        'Colorectal Adenocarcinoma (TCGA, Nature 2012)'
        'Diffuse Glioma (GLASS Consortium, Nature 2019)'
        'Esophagogastric Cancer (MSK, J Natl Cancer Inst 2023)'
        'Diffuse Large B-Cell Lymphoma (MSK, 2024)'
        'Lung Adenocarcinoma (MSK Mind, Nature Cancer 2022)'
        'Mature T and NK Neoplasms (MSK, Blood Adv 2023)'
        'Acute Myeloid Leukemia (TCGA, Firehose Legacy)'
        'Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy)'
        'Kidney Chromophobe (TCGA, Firehose Legacy)'
        'Brain Lower Grade Glioma (TCGA, Firehose Legacy)'
        'Liver Hepatocellular Carcinoma (TCGA, Firehose Legacy)'
        'Lung Adenocarcinoma (TCGA, Firehose Legacy)'
        'Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy)'
        'Mesothelioma (TCGA, Firehose Legacy)'
        'Ovarian Serous Cystadenocarcinoma (TCGA, Firehose Legacy)'
        'Kidney Renal Papillary Cell Carcinoma (TCGA, Firehose Legacy)'
        'Acute Myeloid Leukemia (TCGA, NEJM 2013)'
        'Acral Melanoma (TGEN, Genome Res 2017)'
        'Mature B-cell malignancies (MD Anderson Cancer Center)'
        'Metastatic Melanoma (DFCI, Nature Medicine 2019)'
        'Meningioma (University of Toronto, Nature 2021)'
        'Pancreatic Adenocarcinoma (TCGA, Firehose Legacy)'
        'Pheochromocytoma and Paraganglioma (TCGA, Firehose Legacy)'
        'Prostate Adenocarcinoma (TCGA, Firehose Legacy)'
        'Sarcoma (TCGA, Firehose Legacy)'
        'Stomach Adenocarcinoma (TCGA, Firehose Legacy)'
        'Testicular Germ Cell Cancer (TCGA, Firehose Legacy)'
        'Thymoma (TCGA, Firehose Legacy)'
        'Thyroid Carcinoma (TCGA, Firehose Legacy)'
        'Uterine Corpus Endometrial Carcinoma (TCGA, Firehose Legacy)'
        'Uterine Carcinosarcoma (TCGA, Firehose Legacy)'
        'Uveal Melanoma (TCGA, Firehose Legacy)'
        'Pediatric Rhabdoid Tumor (TARGET, 2018)'
        'Prostate Cancer (DKFZ, Cancer Cell 2018)'
        'Pediatric Preclinical Testing Consortium (CHOP, Cell Rep 2019)'
        'Gastric Cancer (OncoSG, 2018)'
        'Pan-cancer Analysis of Advanced and Metastatic Tumors (BCGSC, Nature Cancer 2020)'
        'Leiomyosarcoma (MSK, 2024)'
        'Metastatic Colorectal Cancer (MSK, Cancer Cell 2018)'
        'Endometrial and Ovarian Cancer (MSK, 2024)'
        'BRAF Fusions - IMPACT Clinical Sequencing Cohort (MSKCC)'
        'Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016)'
        'BRAF Fusions - ARCHER Clinical Sequencing Cohort (MSKCC)' 
        'Diffuse Glioma (GLASS Consortium)'
        'Hepatocellular Carcinoma (MSK, 2024)'
        'Colorectal Cancer (CAS Shanghai, Cancer Cell 2020)'
        'Mature B-Cell Neoplasms (Simon Fraser University, Blood 2023)'
        'Pre-cancer Colorectal Polyps (HTAN Vanderbilt, Cell 2021)'
        'IDH-mutated Diffuse Glioma (MSK, Clin Cancer Res 2024)'
        'Anaplastic Thyroid Cancers (GATCI, Cell Reports 2024)'
        'Prostate Cancer MDA PCa PDX (MD Anderson, Clin Cancer Res 2024)'
        'Soft Tissue and Bone Sarcoma (MSK, Nat Commun 2022)'
        'Pleural Mesothelioma (MSK, Clin Cancer Res 2024)'
        'Prostate Cancer (MSK, Science 2022)'
        'Chronic Lymphocytic Leukemia (Broad, Nature Genetics 2022)'
        'MSK ctDNA Sequencing Cohort (MSK, 2024)'
        'Ovarian Cancer - MSK SPECTRUM (MSK, Nature 2022)'
        'Nerve Sheath Tumors (Johns Hopkins, Sci Data 2020)'
        'Prostate Cancer (MSK 2024)'
        'Pediatric MAPPYACTS (Gustave Roussy, Cancer Discov 2022)'
        'Cutaneous Squamous Cell Carcinoma (UOW, Front Oncol 2022)'
        'Metastatic Breast Cancer (AURORA US Network, Nat Cancer 2023)'
        'Rectal Cancer (MSK, Nature Medicine 2019)'
        'Endometrial Carcinoma cfDNA (MSK, Clin Cancer Res 2022)'
        'Cancer Therapy and Clonal Hematopoiesis (MSK, Blood Adv 2023)'
        'Lung Adenocarcinoma (OncoSG, Nat Genet 2020)'
        'RAD51B Associated Mixed Cancers (MSK, NPJ 2021)'
        'Gastrointestinal Stromal Tumors (MSK, Clin Cancer Res 2023)'
        'Pancreatic Neuroendocrine Tumors (MSK, 2023)'
        'Primary CNS Lymphoma - Ibrutinib Long Term Follow Up (MSK, 2024)'
        'Adrenocortical Carcinoma (TCGA, PanCancer Atlas)'
        'Bladder Urothelial Carcinoma (TCGA, PanCancer Atlas)'
        'Breast Invasive Carcinoma (TCGA, PanCancer Atlas)'
        'Cervical Squamous Cell Carcinoma (TCGA, PanCancer Atlas)'
        'Cholangiocarcinoma (TCGA, PanCancer Atlas)'
        'Diffuse Large B-Cell Lymphoma (TCGA, PanCancer Atlas)'
        'Esophageal Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Glioblastoma Multiforme (TCGA, PanCancer Atlas)'
        'Head and Neck Squamous Cell Carcinoma (TCGA, PanCancer Atlas)'
        'Pan-cancer Analysis of Whole Genomes (ICGC/TCGA, Nature 2020)'
        'Colorectal Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Kidney Chromophobe (TCGA, PanCancer Atlas)'
        'Kidney Renal Clear Cell Carcinoma (TCGA, PanCancer Atlas)'
        'Kidney Renal Papillary Cell Carcinoma (TCGA, PanCancer Atlas)'
        'Acute Myeloid Leukemia (TCGA, PanCancer Atlas)
        'Brain Lower Grade Glioma (TCGA, PanCancer Atlas)'
        'Liver Hepatocellular Carcinoma (TCGA, PanCancer Atlas)'
        'Lung Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Lung Squamous Cell Carcinoma (TCGA, PanCancer Atlas)'
        'Mesothelioma (TCGA, PanCancer Atlas)'
        'Ovarian Serous Cystadenocarcinoma (TCGA, PanCancer Atlas)'
        'Pancreatic Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Pheochromocytoma and Paraganglioma (TCGA, PanCancer Atlas)'
        'Prostate Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Sarcoma (TCGA, PanCancer Atlas)'
        'Skin Cutaneous Melanoma (TCGA, PanCancer Atlas)'
        'Stomach Adenocarcinoma (TCGA, PanCancer Atlas)'
        'Testicular Germ Cell Tumors (TCGA, PanCancer Atlas)'
        'Thyroid Carcinoma (TCGA, PanCancer Atlas)'
        'Thymoma (TCGA, PanCancer Atlas)'
        'Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas)'
        'Uterine Carcinosarcoma (TCGA, PanCancer Atlas)'
        'Uveal Melanoma (TCGA, PanCancer Atlas)'
        'Pancreatic Adenocarcinoma (MSK, 2024)'
        'ERBB2-mutant Endometrial Cancer (MSK, 2024)']
, # Test case with no underscore
        None,  # Test case with missing value
    ]
}

df = pd.DataFrame(data)

# Create the new column (handle missing values with .fillna(''))
df['year of publication'] = df['citation'].astype(str).str.split('20').str[-1].fillna('')

# Print the resulting DataFrame
print(df)

# Assertion for testing
assert df['year of publication'].tolist() == ['15','13','14','16','15', 'None']

                             citation year of publication
0     Martelotto et al. J Pathol 2015                  15
1            Ho et al. Nat Genet 2013                  13
2    Ross et al. Am J Surg Pathl 2014                  14
3  Rettig et al, Cancer Prev Res 2016                  16
4  Mitani et al. Clin Cancer Res 2015                  15
5                                None                None


In [48]:
df_ascending = df.sort_values(by='year of publication', ascending=True)
print(df)

                             citation year of publication
0     Martelotto et al. J Pathol 2015                  15
1            Ho et al. Nat Genet 2013                  13
2    Ross et al. Am J Surg Pathl 2014                  14
3  Rettig et al, Cancer Prev Res 2016                  16
4  Mitani et al. Clin Cancer Res 2015                  15
5                                None                None


In [None]:
import requests

headers = {
    'Content-Type': 'application/json'  
}

studies_url = 'https://www.cbioportal.org/api/studies'

# Make GET request to fetch study data
response = requests.get(studies_url, headers=headers)
    
# Check if request was successful
if response.status_code == 200:
    # Convert response to JSON
    studies_data = response.json()
    
    # Initialize a list to store study names
    study_names = []
    
    # Extract study names from each study object
    for study in studies_data:
        if 'name' in study:
            study_names.append(study['name'])
    
    # Print or use study names as needed
    print("Study Names:", study_names)
        
else:
    print("Error:", response.status_code, response.text)



Study Names: ['Adenoid Cystic Carcinoma of the Breast (MSK, J Pathol. 2015)', 'Adenoid Cystic Carcinoma (MSK, Nat Genet 2013)', 'Adenoid Cystic Carcinoma (FMI, Am J Surg Pathl. 2014)', 'Adenoid Cystic Carcinoma (JHU, Cancer Prev Res 2016)', 'Adenoid Cystic Carcinoma (MDA, Clin Cancer Res 2015)', 'Adenoid Cystic Carcinoma (MGH, Nat Gen 2016)', 'Adenoid Cystic Carcinoma (Sanger/MDA, JCI 2013)', 'Adenoid Cystic Carcinoma Project (J Clin Invest 2019)', 'Basal Cell Carcinoma (UNIGE, Nat Genet 2016)', 'Acute Lymphoblastic Leukemia (St Jude, Nat Genet 2015)', 'Ampullary Carcinoma (Baylor College of Medicine, Cell Reports 2016)', 'Hypodiploid Acute Lymphoid Leukemia (St Jude, Nat Genet 2013)', 'Acute Lymphoblastic Leukemia (St Jude, Nat Genet 2016)', 'The Angiosarcoma Project - Count Me In (Nature Medicine, 2020)', 'Breast Fibroepithelial Tumors (Duke-NUS, Nat Genet 2015)', 'Acute Myeloid Leukemia (OHSU, Nature 2018)', 'Appendiceal Cancer (MSK, J Clin Oncol 2022)', 'Metastatic Biliary Tract Ca

In [None]:
import requests

headers = {
    'Content-Type': 'application/json'  
}

studies_url = 'https://www.cbioportal.org/api/studies'

# Make GET request to fetch study data
response = requests.get(studies_url, headers=headers)
    
# Check if request was successful
if response.status_code == 200:
    # Convert response to JSON
    studies_data = response.json()
    
    # Initialize a list to store study years
    study_years = []
    
    # Extract study years from each study object
    for study in studies_data:
        if 'name' in study:
            # Extract the year from the cancer type ID or any relevant field
            # Example: Assuming the year is embedded in the cancer type ID
            year = study['name'].split('20')[-1]  # Extracts the last part assuming it's a year
            study_years.append(year)
    
    # Print or use study years as needed
    print("Study Years:", study_years)
        
else:
    print("Error:", response.status_code, response.text)


Study Years: ['15)', '13)', '14)', '16)', '15)', '16)', '13)', '19)', '16)', '15)', '16)', '13)', '16)', ')', '15)', '18)', '22)', '22)', '22)', '14)', '13)', '13)', '14)', '17)', '16)', '12)', '14)', ')', '22)', '16)', '19)', '15)', '22)', ')', ')', '22)', '12)', '12)', '12)', '14)', '13)', '12)', '18)', '19)', ')', '18)', '19)', '22)', '13)', '13)', '12)', '15)', '15)', '11)', '15)', '18)', '17)', 'Colorectal Adenocarcinoma (TCGA, Firehose Legacy)', '12)', '14)', '15)', '16)', '14)', '19)', '21)', ')', ')', '22)', '22)', '15)', '15)', '15)', '12)', '17)', '17)', '18)', ')', '21)', '22)', '14)', '13)', '14)', '14)', '14)', '14)', '08)', '13)', '16)', '18)', ')', ')', '19)', '21)', '22)', '16)', '15)', '18)', '18)', '19)', '19)', '19)', ')', '19)', '21)', '22)', '22)', '23)', '12)', '13)', '14)', '11)', '11)', '15)', '13)', '13)', '14)', '21)', '15)', '21)', '14)', '16)', '14)', '12)', '15)', '12)', '14)', '08)', '14)', '17)', ')', '23)', '21)', '21)', ')', '22)', '12)', '12)', '12)', 

In [None]:
import collections
years = ['15)', '13)', '14)', '16)', '15)', '16)', '13)', '19)', '16)', '15)', '16)', '13)', '16)', ')', '15)', '18)', '22)', '22)', '22)', '14)', '13)', '13)', '14)', '17)', '16)', '12)', '14)', ')', '22)', '16)', '19)', '15)', '22)', ')', ')', '22)', '12)', '12)', '12)', '14)', '13)', '12)', '18)', '19)', ')', '18)', '19)', '22)', '13)', '13)', '12)', '15)', '15)', '11)', '15)', '18)', '17)', 'Colorectal Adenocarcinoma (TCGA, Firehose Legacy)', '12)', '14)', '15)', '16)', '14)', '19)', '21)', ')', ')', '22)', '22)', '15)', '15)', '15)', '12)', '17)', '17)', '18)', ')', '21)', '22)', '14)', '13)', '14)', '14)', '14)', '14)', '08)', '13)', '16)', '18)', ')', ')', '19)', '21)', '22)', '16)', '15)', '18)', '18)', '19)', '19)', '19)', ')', '19)', '21)', '22)', '22)', '23)', '12)', '13)', '14)', '11)', '11)', '15)', '13)', '13)', '14)', '21)', '15)', '21)', '14)', '16)', '14)', '12)', '15)', '12)', '14)', '08)', '14)', '17)', ')', '23)', '21)', '21)', ')', '22)', '12)', '12)', '12)', '12)', '13)', '11)', '16)', '16)', 'Thoracic PDX (MSK, Provisional)', ')', ')', '16)', '21)', '21)', '21)', '17)', '21)', '21)', '22)', '14)', '14)', '12)', '16)', '17)', '16)', '18)', '13)', '16)', '17)', '22)', '21)', ')', '21)', '21)', '21)', '22)', '22)', '22)', '14)', '14)', '16)', '15)', '16)', '13)', '11)', '13)', '17)', '17)', '17)', '18)', '18)', '15)', '18)', '22)', '23)', '11)', '14)', '12)', '15)', '11)', '15)', '13)', '15)', '16)', '11)', '17)', '13)', '17)', '17)', ')', '22)', '21)', '22)', '23)', '16)', '12)', '13)', '12)', '10)', '14)', '15)', '15)', '17)', '14)', ')', ')', '21)', '22)', '22)', 'Skin Cutaneous Melanoma (TCGA, Firehose Legacy)', '12)', '12)', '12)', '12)', '14)', '14)', '10)', '14)', '17)', '14)', '17)', '15)', '15)', '15)', '22)', '21)', '22)', '12)', '14)', '14)', '11)', '14)', '14)', '16)', '14)', '17)', '17)', '17)', '17)', '18)', '18)', '19)', '19)', ')', ')', '22)', '14)', '13)', '16)', '16)', '15)', '18)', '18)', '17)', '18)', '19)', ')', ')', ')', '21)', ')', '22)', '12)', '19)', '23)', '18)', '23)', ')', '23)', '23)', '23)', '23)', '23)', '23)', '23)', '23)', '23)', '23)', '23)', 'Adrenocortical Carcinoma (TCGA, Firehose Legacy)', 'Bladder Urothelial Carcinoma (TCGA, Firehose Legacy)', '15)', ')', '18)', '18)', '17)', '16)', '22)', ')', '21)', 'Breast Invasive Carcinoma (TCGA, Firehose Legacy)', 'Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (TCGA, Firehose Legacy)', 'Cholangiocarcinoma (TCGA, Firehose Legacy)', 'Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (TCGA, Firehose Legacy)', 'Esophageal Carcinoma (TCGA, Firehose Legacy)', 'Glioblastoma Multiforme (TCGA, Firehose Legacy)', 'Head and Neck Squamous Cell Carcinoma (TCGA, Firehose Legacy)', '15)', '12)', '19)', '23)', '24)', '22)', '23)', 'Acute Myeloid Leukemia (TCGA, Firehose Legacy)', 'Kidney Renal Clear Cell Carcinoma (TCGA, Firehose Legacy)', 'Kidney Chromophobe (TCGA, Firehose Legacy)', 'Brain Lower Grade Glioma (TCGA, Firehose Legacy)', 'Liver Hepatocellular Carcinoma (TCGA, Firehose Legacy)', 'Lung Adenocarcinoma (TCGA, Firehose Legacy)', 'Lung Squamous Cell Carcinoma (TCGA, Firehose Legacy)', 'Mesothelioma (TCGA, Firehose Legacy)', 'Ovarian Serous Cystadenocarcinoma (TCGA, Firehose Legacy)', 'Kidney Renal Papillary Cell Carcinoma (TCGA, Firehose Legacy)', '13)', '17)', 'Mature B-cell malignancies (MD Anderson Cancer Center)', '19)', '21)', 'Pancreatic Adenocarcinoma (TCGA, Firehose Legacy)', 'Pheochromocytoma and Paraganglioma (TCGA, Firehose Legacy)', 'Prostate Adenocarcinoma (TCGA, Firehose Legacy)', 'Sarcoma (TCGA, Firehose Legacy)', 'Stomach Adenocarcinoma (TCGA, Firehose Legacy)', 'Testicular Germ Cell Cancer (TCGA, Firehose Legacy)', 'Thymoma (TCGA, Firehose Legacy)', 'Thyroid Carcinoma (TCGA, Firehose Legacy)', 'Uterine Corpus Endometrial Carcinoma (TCGA, Firehose Legacy)', 'Uterine Carcinosarcoma (TCGA, Firehose Legacy)', 'Uveal Melanoma (TCGA, Firehose Legacy)', '18)', '18)', '19)', '18)', ')', '24)', '18)', '24)', 'BRAF Fusions - IMPACT Clinical Sequencing Cohort (MSKCC)', '16)', 'BRAF Fusions - ARCHER Clinical Sequencing Cohort (MSKCC)', 'Diffuse Glioma (GLASS Consortium)', '24)', ')', '23)', '21)', '24)', '24)', '24)', '22)', '24)', '22)', '22)', '24)', '22)', ')', '24)', '22)', '22)', '23)', '19)', '22)', '23)', ')', '21)', '23)', '23)', '24)', 'Adrenocortical Carcinoma (TCGA, PanCancer Atlas)', 'Bladder Urothelial Carcinoma (TCGA, PanCancer Atlas)', 'Breast Invasive Carcinoma (TCGA, PanCancer Atlas)', 'Cervical Squamous Cell Carcinoma (TCGA, PanCancer Atlas)', 'Cholangiocarcinoma (TCGA, PanCancer Atlas)', 'Diffuse Large B-Cell Lymphoma (TCGA, PanCancer Atlas)', 'Esophageal Adenocarcinoma (TCGA, PanCancer Atlas)', 'Glioblastoma Multiforme (TCGA, PanCancer Atlas)', 'Head and Neck Squamous Cell Carcinoma (TCGA, PanCancer Atlas)', ')', 'Colorectal Adenocarcinoma (TCGA, PanCancer Atlas)', 'Kidney Chromophobe (TCGA, PanCancer Atlas)', 'Kidney Renal Clear Cell Carcinoma (TCGA, PanCancer Atlas)', 'Kidney Renal Papillary Cell Carcinoma (TCGA, PanCancer Atlas)', 'Acute Myeloid Leukemia (TCGA, PanCancer Atlas)', 'Brain Lower Grade Glioma (TCGA, PanCancer Atlas)', 'Liver Hepatocellular Carcinoma (TCGA, PanCancer Atlas)', 'Lung Adenocarcinoma (TCGA, PanCancer Atlas)', 'Lung Squamous Cell Carcinoma (TCGA, PanCancer Atlas)', 'Mesothelioma (TCGA, PanCancer Atlas)', 'Ovarian Serous Cystadenocarcinoma (TCGA, PanCancer Atlas)', 'Pancreatic Adenocarcinoma (TCGA, PanCancer Atlas)', 'Pheochromocytoma and Paraganglioma (TCGA, PanCancer Atlas)', 'Prostate Adenocarcinoma (TCGA, PanCancer Atlas)', 'Sarcoma (TCGA, PanCancer Atlas)', 'Skin Cutaneous Melanoma (TCGA, PanCancer Atlas)', 'Stomach Adenocarcinoma (TCGA, PanCancer Atlas)', 'Testicular Germ Cell Tumors (TCGA, PanCancer Atlas)', 'Thyroid Carcinoma (TCGA, PanCancer Atlas)', 'Thymoma (TCGA, PanCancer Atlas)', 'Uterine Corpus Endometrial Carcinoma (TCGA, PanCancer Atlas)', 'Uterine Carcinosarcoma (TCGA, PanCancer Atlas)', 'Uveal Melanoma (TCGA, PanCancer Atlas)', '24)', '24)']
counter = collections.Counter(years)



This will give a dropdown with all the different APIs, similar to how you can see them here on the cBioPortal website: https://www.cbioportal.org/api/swagger-ui.html#/. If you don't see the dropdown, some people have found that running `%config Completer.use_jedi = False` fixes the problem. 

You can also get the parameters to a specific endpoint by pressing shift+tab twice after typing the name of the specific endpoint e.g.:

In [None]:
cbioportal.Cancer_Types.getAllCancerTypesUsingGET(sortBy='parent').result()
print('parent')

That shows one of the parameters is `cancerTypeId` of type `string`, the example `acc` is mentioned:

In [None]:
acc = cbioportal.
print(acc)

You can see that the JSON output returned by the cBioPortal API gets automatically converted into an object called `TypeOfCancer`. This object can be explored interactively as well by pressing tab after typing `acc.`:

In [None]:
acc.parent

### cBioPortal API

[cBioPortal](https://www.cbioportal.org) stores cancer genomics data from a large number of published studies. Let's figure out:

- how many studies are there?
- how many cancer types do they span?
- how many samples in total?
- which study has the largest number of samples?

In [None]:
studies = cbioportal.Studies.getAllStudiesUsingGET().result()
cancer_types = cbioportal.Cancer_Types.getAllCancerTypesUsingGET().result()

print("In total there are {} studies in cBioPortal, spanning {} different types of cancer.".format(
    len(studies),
    len(cancer_types)
))

To get the total number of samples in each study we have to look a bit more at the response of the studies endpoint:

In [None]:
studies[0]

In [None]:
dir(studies[-2])

We can sum the `allSampleCount` values of each study in cBioPortal:

In [None]:
print("The total number of samples in all studies is: {}".format(sum([x.allSampleCount for x in studies])))

Let's see which study has the largest number of samples:

In [None]:
sorted_studies = sorted(studies, key=lambda x: x.allSampleCount)
sorted_studies[-1]

CancerStudy(allSampleCount=42714, cancerType=None, cancerTypeId='mixed', citation='Stonestrom et al. Blood Adv 2024', cnaSampleCount=None, completeSampleCount=None, description='Targeted sequencing of 47,532 patient samples with mixed tumor types and their matched normals to identify clonal hematopoiesis mutations using MSK-IMPACT.', groups='', importDate='2024-06-28 04:07:35', massSpectrometrySampleCount=None, methylationHm27SampleCount=None, miRnaSampleCount=None, mrnaMicroarraySampleCount=None, mrnaRnaSeqSampleCount=None, mrnaRnaSeqV2SampleCount=None, name='Cancer Therapy and Clonal Hematopoiesis (MSK, Blood Adv 2023)', pmid='38147626', publicStudy=True, readPermission=True, referenceGenome='hg19', rppaSampleCount=None, sequencedSampleCount=None, status=0, studyId='msk_ch_2023', treatmentCount=None)

We can also easily see e.g. the top 3 largest studies:

In [None]:
sorted_studies = sorted(studies, key=lambda x: x.allSampleCount)
sorted_studies[-3:]

[CancerStudy(allSampleCount=24146, cancerType=None, cancerTypeId='mixed', citation='Kelly et al. Nat Genet 2020', cnaSampleCount=None, completeSampleCount=None, description='Clonal hematopoiesis mutations identified in blood samples from 24,146 patients whose tumor-blood pairs were analyzed using MSK-IMPACT.', groups='PUBLIC', importDate='2023-12-08 13:24:19', massSpectrometrySampleCount=None, methylationHm27SampleCount=None, miRnaSampleCount=None, mrnaMicroarraySampleCount=None, mrnaRnaSeqSampleCount=None, mrnaRnaSeqV2SampleCount=None, name='Cancer Therapy and Clonal Hematopoiesis (MSK, Nat Genet 2020)', pmid='33106634', publicStudy=True, readPermission=True, referenceGenome='hg19', rppaSampleCount=None, sequencedSampleCount=None, status=0, studyId='msk_ch_2020', treatmentCount=None),
 CancerStudy(allSampleCount=25775, cancerType=None, cancerTypeId='mixed', citation='Nguyen et al. Cell 2022', cnaSampleCount=None, completeSampleCount=None, description='MSK-MET (Memorial Sloan Kettering

Now that we've answered the inital questions we can dig a little deeper into a specific study. Let's use the second largest study (`msk_impact_2017`):

- How many patients are in this study?
- What gene is most commonly mutated across the different samples?
- Does this study span one or more types of cancer?

The description of the study with id `msk_impact_2017` study mentions there are 24,146 patients sequenced. Can we find this data in the cBioPortal?

In [None]:
patients = cbioportal.Patients.getAllPatientsInStudyUsingGET(studyId='msk_impact_2017').result()
print("The msk_impact_2017 study spans {} patients".format(len(patients)))

Now let's try to figure out what gene is most commonly mutated. For this we can check the endpoints in the group `Mutations`. When looking at these endpoints it seems that a study can have multiple molecular profiles. This is because samples might have been sequenced using different assays (e.g. targeting a subset of genes or all genes). An example for the `acc_tcga` study is given for a molecular profile (`acc_tcga_mutations`) and a collection of samples (`acc_tcga_all`). We can use the same approach for the `msk_impact_2017` study. This will take a few seconds.  You can use the command `%%time` to time a cell):

In [None]:
%%time

mutations = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    molecularProfileId='msk_impact_2017_mutations',
    sampleListId='msk_impact_2017_all'
).result()

We can explore what the mutation data structure looks like:

In [None]:
mutations[0]

Mutation(alleleSpecificCopyNumber=None, aminoAcidChange=None, center='NA', chr='9', driverFilter=None, driverFilterAnnotation=None, driverTiersFilter=None, driverTiersFilterAnnotation=None, endPosition=133760514, entrezGeneId=25, gene=Gene(entrezGeneId=25, geneticEntityId=None, hugoGeneSymbol='ABL1', type='protein-coding'), keyword='ABL1 truncating', molecularProfileId='msk_impact_2017_mutations', mutationStatus='NA', mutationType='Frame_Shift_Del', namespaceColumns=None, ncbiBuild='GRCh37', normalAltCount=None, normalRefCount=None, patientId='P-0000507', proteinChange='K947Sfs*122', proteinPosEnd=947, proteinPosStart=946, referenceAllele='C', refseqMrnaId='NM_005157.4', sampleId='P-0000507-T01-IM3', startPosition=133760514, studyId='msk_impact_2017', tumorAltCount=80, tumorRefCount=759, uniquePatientKey='UC0wMDAwNTA3Om1za19pbXBhY3RfMjAxNw', uniqueSampleKey='UC0wMDAwNTA3LVQwMS1JTTM6bXNrX2ltcGFjdF8yMDE3', validationStatus='NA', variantAllele='-', variantType='DEL')

It seems that the `gene` field is not filled in. To keep the response size of the API small, the API uses a parameter called `projection` that indicates whether or not to return all fields of an object or only a portion of the fields. By default it will use the `SUMMARY` projection. But because in this case we want to `gene` information, we'll use the `DETAILED` projection instead, so let's update the previous statement:

In [None]:
%%time 

mutations = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    molecularProfileId='msk_impact_2017_mutations',
    sampleListId='msk_impact_2017_all',
    projection='DETAILED'
).result()

CPU times: user 1.99 s, sys: 84.4 ms, total: 2.07 s
Wall time: 6.94 s


You can see the response time is slightly slower. Let's check if the gene field is filled in now:

In [None]:
mutations[0]

Now that we have the gene field we can check what gene is most commonly mutated: 

In [None]:
from collections import Counter
mutation_counts = Counter([m.gene.hugoGeneSymbol for m in mutations])
mutation_counts.most_common(5)

We can verify that these results are correct by looking at the study view of the MSK-IMPACT study on the cBioPortal website: https://www.cbioportal.org/study/summary?id=msk_impact_2017. Note that the website uses the REST API we've been using in this hackathon, so we would expect those numbers to be the same, but good to do a sanity check. We see that the number of patients is indeed 10,336. But the number of samples with a mutation in TP53 is 4,561 instead of 4,985. Can you spot why they differ?

Next question:

- How many samples have a TP53 mutation?

For this exercise it might be useful to use a [pandas dataframe](https://pandas.pydata.org/) to be able to do grouping operations. You can convert the mutations result to a dataframe like this:

In [None]:
{k:getattr(mutations[0],k) for k in dir(mutations[0])}

{'alleleSpecificCopyNumber': None,
 'aminoAcidChange': None,
 'center': 'NA',
 'chr': '9',
 'driverFilter': None,
 'driverFilterAnnotation': None,
 'driverTiersFilter': None,
 'driverTiersFilterAnnotation': None,
 'endPosition': 133760514,
 'entrezGeneId': 25,
 'gene': Gene(entrezGeneId=25, geneticEntityId=None, hugoGeneSymbol='ABL1', type='protein-coding'),
 'keyword': 'ABL1 truncating',
 'molecularProfileId': 'msk_impact_2017_mutations',
 'mutationStatus': 'NA',
 'mutationType': 'Frame_Shift_Del',
 'namespaceColumns': None,
 'ncbiBuild': 'GRCh37',
 'normalAltCount': None,
 'normalRefCount': None,
 'patientId': 'P-0000507',
 'proteinChange': 'K947Sfs*122',
 'proteinPosEnd': 947,
 'proteinPosStart': 946,
 'referenceAllele': 'C',
 'refseqMrnaId': 'NM_005157.4',
 'sampleId': 'P-0000507-T01-IM3',
 'startPosition': 133760514,
 'studyId': 'msk_impact_2017',
 'tumorAltCount': 80,
 'tumorRefCount': 759,
 'uniquePatientKey': 'UC0wMDAwNTA3Om1za19pbXBhY3RfMjAxNw',
 'uniqueSampleKey': 'UC0wMDAwNT

In [None]:
import pandas as pd

mdf = pd.DataFrame.from_dict([
    # python magic that combines two dictionaries:
    dict(
        {k:getattr(m,k) for k in dir(m)},
        **{k:getattr(m.gene,k) for k in dir(m.gene)}) 
    # create one item in the list for each mutation
    for m in mutations
])

In [None]:
mdf.columns

In [None]:
mdf['chr startPosition endPosition referenceAllele variantAllele hugoGeneSymbol proteinChange'.split()].head()

The DataFrame is a data type originally from `Matlab` and `R` that makes it easier to work with columnar data. Pandas brings that data type to Python. There are also several performance optimizations by it using the data types from [numpy](https://www.numpy.org/).

Now that you have the data in a Dataframe you can group the mutations by the gene name and count the number of unique samples in TP53:

In [None]:
sample_count_per_gene = mdf.groupby('hugoGeneSymbol')['uniqueSampleKey'].nunique()

print("There are {} samples with a mutation in TP53".format(
    sample_count_per_gene['TP53']
))

It would be nice to visualize this result in context of the other genes by plotting the top 10 most mutated genes. For this you can use the matplotlib interface that integrates with pandas.

First  inline plotting in the notebook:

In [None]:
%matplotlib inline

In [None]:
sample_count_per_gene.sort_values(ascending=False).head(10).plot(kind='bar')

Make it look a little nicer by importing seaborn:

In [None]:
import seaborn as sns
sns.set_style("white")
sns.set_context('notebook')

In [None]:
sample_count_per_gene.sort_values(ascending=False).head(10).plot(kind='bar')
sns.despine(trim=False)

You can further change the plot a bit by using the arguments to the plot function or using the matplotlib interface directly:

In [None]:
import matplotlib.pyplot as plt

sample_count_per_gene.sort_values(ascending=False).head(10).plot(
    kind='bar',
    ylim=[0,5000],
    color='green'
)
sns.despine(trim=False)
plt.xlabel('')
plt.xticks(rotation=300)
plt.ylabel('Number of samples',labelpad=20)
plt.title('Number of mutations in genes in MSK-IMPACT (2017)',pad=25)

A further extension of this plot could be to color the bar chart by the type of mutation in that sample (`mdf.mutationType`) and to include copy number alterations (see `Discrete Copy Number Alterations` endpoints).

#### Gene Frequency for specific Cancer Type
First get the cancer type for each sample via the clinical data endpoint:

In [None]:
clin_data = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='msk_impact_2017').result()

Convert into dataframe:

In [None]:

cdf = pd.DataFrame.from_dict(
    {k:getattr(cd,k) for k in dir(cd)} for cd in clin_data
).pivot(index='sampleId patientId'.split(),columns='clinicalAttributeId',values='value')

Merge together with the mutations MAF:

In [None]:
clin_mut_df = mdf.merge(cdf, on='sampleId', how='left')

Then plot the top 10 most mutated genes for colorectal cancer:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Filter clin_mut_df for Colorectal Cancer
colorectal_df = clin_mut_df[clin_mut_df['CANCER_TYPE'] == 'Colorectal Cancer']

# Group by hugoGeneSymbol and count unique patientId
gene_counts = colorectal_df.groupby('hugoGeneSymbol')['patientId'].nunique()

# Sort in descending order and take top 10
sorted_counts = gene_counts.sort_values(ascending=False).head(30)

# Create bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=sorted_counts.index, y=sorted_counts.values)
plt.xlabel('hugoGeneSymbol')
plt.ylabel('Number of Patients')
plt.title('Top 30 hugoGeneSymbols in Colorectal Cancer')
plt.xticks(rotation=45)
sns.despine()
plt.show()

You can do the same by frequency as well. For MSK-IMPACT this is slightly tricky because not all samples were profiled for the same gene. For convenience we will ignore this fact and assume that all samples were profiled for all genes. This is not true, but it will give you a rough idea of the frequency of mutations in the MSK-IMPACT study. First get the total number of cases for colorectal cancer:

In [None]:
cancer_type_num_patients = cdf.reset_index().groupby("CANCER_TYPE")["patientId"].nunique()

Now plot the top 10 most mutated genes for colorectal cancer by frequency in cohort:

In [None]:
# Filter clin_mut_df for Colorectal Cancer
cancer_type = 'Colorectal Cancer'
colorectal_df = clin_mut_df[clin_mut_df['CANCER_TYPE'] == cancer_type]
nr_patients_for_cancer_type = cancer_type_num_patients[cancer_type]

# Group by hugoGeneSymbol and count unique patientId
gene_counts = colorectal_df.groupby('hugoGeneSymbol')['patientId'].nunique() * 100.0 / nr_patients_for_cancer_type

# Sort in descending order and take top 10
sorted_counts = gene_counts.sort_values(ascending=False).head(30)

# Create bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=sorted_counts.index, y=sorted_counts.values)
plt.xlabel('hugoGeneSymbol')
plt.ylabel('Number of Patients')
plt.title('Top 30 hugoGeneSymbols in {} Cancer (n={})'.format(cancer_type, nr_patients_for_cancer_type))
plt.xticks(rotation=45)
sns.despine()
plt.show()

####  Alteration frequency across cancer types per gene
We previously only worked with mutation data. We will need to get the copy number as well. SV data is not available yet in the API. First get the copy number data for the MSK-IMPACT study:

In [None]:
cna_data = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(molecularProfileId='msk_impact_2017_cna',sampleListId='msk_impact_2017_all',projection='DETAILED').result()
cna_df = pd.DataFrame.from_dict([
    # python magic that combines two dictionaries:
    dict(
        {k:getattr(cna,k) for k in dir(cna)},
        **{k:getattr(cna.gene,k) for k in dir(cna.gene)}) 
    # create one item in the list for each mutation
    for cna in cna_data
])
del cna_df['gene']
cna_df['hugoGeneSymbol alteration'.split()].head(5)

Create an alterations dataframe that combines mutations, CNA and clinical data:

In [None]:
# First, we need to add a column to each DataFrame that indicates the type of alteration
mdf['alteration'] = 'Mutation'
cna_df['alteration'] = cna_df['alteration'].map({-2: 'Deep Deletion', 2: 'Amplification'})

# Concatenate the two DataFrames
alterations_df = pd.concat([mdf, cna_df])

# Group by patient and gene and count the number of alterations
alterations_df = alterations_df.groupby(['sampleId','patientId', 'hugoGeneSymbol'])['alteration'].apply(list).reset_index()

# Create a new column that indicates whether there are multiple alterations
alterations_df['alteration'] = alterations_df['alteration'].apply(lambda x: 'Multiple Alterations' if len(x) > 1 else x[0])

alterations_df = alterations_df.merge(cdf, on='sampleId', how='left')

print(alterations_df.head())

Then make the stacked barchart:

In [None]:
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns

gene = 'ERBB2'
colors = {
    'Mutation': '#008000',  # rgb(0, 128, 0)
    'Amplification': '#FF0000',  # rgb(255, 0, 0)
    'Multiple Alterations': '#666666',  # rgb(102, 102, 102)
    'Deep Deletion': '#0000FF'  # rgb(0, 0, 255)
}
order = ['Multiple Alterations', 'Deep Deletion', 'Amplification', 'Mutation']

# Count the number of patients for each cancer type and alteration
counts_df = alterations_df[alterations_df['hugoGeneSymbol'] == gene].groupby(['CANCER_TYPE', 'alteration'])['patientId'].nunique().unstack().fillna(0)

# Ensure that the index of cancer_type_num_patients matches the index of counts_df
cancer_type_num_patients = cancer_type_num_patients.reindex(counts_df.index)

# Divide counts_df by cancer_type_num_patients to get alteration frequencies
alteration_frequencies = counts_df.div(cancer_type_num_patients, axis='index') * 100.0

# Sort by frequencies and select top 10
alteration_frequencies_sorted = alteration_frequencies.loc[alteration_frequencies.sum(axis=1).sort_values(ascending=False).index[:20]]
alteration_frequencies_sorted = alteration_frequencies_sorted[order]

# Plot
ax = alteration_frequencies_sorted.plot(kind='bar', stacked=True, color=[colors[col] for col in alteration_frequencies_sorted.columns])
plt.ylabel('Number of Patients')
plt.xlabel('')
plt.title('{} Alteration Frequency per Cancer Type'.format(gene))

# Reorder the legend
handles, labels = ax.get_legend_handles_labels()
order_index = [labels.index(alteration) for alteration in reversed(order)]
ax.legend([handles[idx] for idx in order_index],[labels[idx] for idx in order_index])

# Despine
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add percentage sign to y ticks
formatter = FuncFormatter(lambda y, pos: "%d%%" % (y))
ax.yaxis.set_major_formatter(formatter)

# Add horizontal grid lines
ax.yaxis.grid(True, linestyle='--', linewidth=0.5, color='lightgray')

plt.show()

### Genome Nexus API

[Genome Nexus](https://www.genomenexus.org) is a web service that aggregates all cancer related information about a particular mutation. Similarly to cBioPortal it provides a REST API following the [Swagger / OpenAPI specification](https://swagger.io/specification/).

In [None]:
from bravado.client import SwaggerClient

gn = SwaggerClient.from_url('https://www.genomenexus.org/v2/api-docs',
                            config={"validate_requests":False,
                                    "validate_responses":False,
                                    "validate_swagger_spec":False})
print(gn)

To look up annotations for a single variant, one can use the following endpoint:

In [None]:
variant = gn.annotation_controller.fetchVariantAnnotationByGenomicLocationGET(
    genomicLocation='7,140453136,140453136,A,T',
    # adds extra annotation resources, not included in default response:
    fields='hotspots mutation_assessor annotation_summary'.split()
).result()

You can see a lot of information is provided for that particular variant if you type tab after `variant.`:

In [None]:
variant.

For this example we will focus on the hotspot annotation and ignore the others. [Cancer hotspots](https://www.cancerhotspots.org/) is a popular web resource  which indicates whether particular variants have been found to be recurrently mutated in large scale cancer genomics data.

The example variant above is a hotspot:

In [None]:
variant.hotspots

Let's see how many hotspot mutations there are in the Cholangiocarcinoma (TCGA, PanCancer Atlas) study with study id `chol_tcga_pan_can_atlas_2018` from the cBioPortal:

In [None]:
%%time

cbioportal = SwaggerClient.from_url('https://www.cbioportal.org/api/api-docs',
                                config={"validate_requests":False,"validate_responses":False})

mutations = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    molecularProfileId='chol_tcga_pan_can_atlas_2018_mutations',
    sampleListId='chol_tcga_pan_can_atlas_2018_all',
    projection='DETAILED'
).result()

Convert the results to a dataframe again:

In [None]:
import pandas as pd

mdf = pd.DataFrame.from_dict([
    # python magic that combines two dictionaries:
    dict(
        {k:getattr(m,k) for k in dir(m)},
        **{k:getattr(m.gene,k) for k in dir(m.gene)}) 
    # create one item in the list for each mutation
    for m in mutations
])

Then get only the unique mutations, to avoid calling the web service with the same variants:

In [None]:
variants = mdf['chr startPosition endPosition referenceAllele variantAllele'.split()]\
    .drop_duplicates()\
    .dropna(how='any',axis=0)\
    .reset_index(drop=True)

Convert them to input that genome nexus will understand:

In [None]:
variants

In [None]:
variants = variants.rename(columns={'chr':'chromosome','startPosition':'start','endPosition':'end'})\
    .to_dict(orient='records')

In [None]:
print("There are {} mutations left to annotate".format(len(variants)))

Annotate them with genome nexus:

In [None]:
%%time 

variants_annotated = gn.annotation_controller.fetchVariantAnnotationByGenomicLocationPOST(
    genomicLocations=variants,
    fields='hotspots annotation_summary'.split()
).result()

Index the variants to make it easier to query them:

In [None]:
gn_dict = {
    "{},{},{},{},{}".format(
        v.annotation_summary.genomicLocation.chromosome,
        v.annotation_summary.genomicLocation.start,
        v.annotation_summary.genomicLocation.end,
        v.annotation_summary.genomicLocation.referenceAllele,
        v.annotation_summary.genomicLocation.variantAllele)
        :
        v for v in variants_annotated
}

Add a new column to indicate whether something is a hotspot

In [None]:
def is_hotspot(x):
    """TODO: Current structure for hotspots in Genome Nexus is a little funky.
    Need to check whether all lists in the annotation field are empty."""
    if x:
        return sum([len(a) for a in x.hotspots.annotation]) > 0
    else:
        return False

def create_dict_query_key(x):
    return "{},{},{},{},{}".format(
        x.chr, x.startPosition, x.endPosition, x.referenceAllele, x.variantAllele
    )

In [None]:
mdf['is_hotspot'] = mdf.apply(lambda x: is_hotspot(gn_dict.get(create_dict_query_key(x), None)), axis=1)

Then plot the results:

In [None]:
%matplotlib inline
import seaborn as sns
sns.set_style("white")
sns.set_context('notebook')
import matplotlib.pyplot as plt

In [None]:
mdf.groupby('hugoGeneSymbol').is_hotspot.sum().sort_values(ascending=False).head(10).plot(kind='bar')

sns.despine(trim=False)
plt.xlabel('')
plt.xticks(rotation=300)
plt.ylabel('Number of non-unique hotspots',labelpad=20)
plt.title('Hotspots in Cholangiocarcinoma (TCGA, PanCancer Atlas)',pad=25)

### OncoKB API

[OncoKB](https://oncokb.org) is is a precision oncology knowledge base and contains information about the effects and treatment implications of specific cancer gene alterations. Similarly to cBioPortal and Genome Nexus it provides a REST API following the [Swagger / OpenAPI specification](https://swagger.io/specification/).

In [None]:
oncokb = SwaggerClient.from_url('https://www.oncokb.org/api/v1/v2/api-docs',
                            config={"validate_requests":False,
                                    "validate_responses":False,
                                    "validate_swagger_spec":False})
print(oncokb)

To look up annotations for a variant, one can use the following endpoint:

In [None]:
variant = oncokb.Annotations.annotateMutationsByGenomicChangeGetUsingGET(
    genomicLocation='7,140453136,140453136,A,T',
).result()

You can see a lot of information is provided for that particular variant if you type tab after `variant.`:

In [None]:
variant.

For instance we can see the summary information about it:

In [None]:
variant.variantSummary

If you look up this variant on the OncoKB website: https://www.oncokb.org/gene/BRAF/V600E. You can see that there are various combinations of drugs and their level of evidence listed. This is a classification system for indicating how much we know about whether or not a patient might respond to a particular treatment. Please see https://www.oncokb.org/levels for more information about the levels of evidence for therapeutic biomarkers.

We can use the same `variants` we pulled from cBioPortal in the previous section to figure out the highest level of each variant.

In [None]:
%%time 

variants_annotated = oncokb.Annotations.annotateMutationsByGenomicChangePostUsingPOST(
    body=[
        {"genomicLocation":"{chromosome},{start},{end},{referenceAllele},{variantAllele}".format(**v)} 
        for v in variants
    ],
).result()

Count the highes level for each variant

In [None]:
from collections import Counter

counts_per_level = Counter([va.highestSensitiveLevel for va in variants_annotated if va.highestSensitiveLevel])

Then plot them

In [None]:
pd.DataFrame(counts_per_level,index=[0]).plot(kind='bar', colors=['#4D8834','#2E2E2C','#753579'])
plt.xticks([])
plt.ylabel('Number of variants')
plt.title('Actionable variants in chol_tcga_pan_can_atlas_2018')
sns.despine()