# GENIE and TCGA Biquery AI Notebook Example

The following notebook example shows how to write an advanced query using magic command that allows you to run queries with minimal code and visualize the results. It calculates the frequency of SNV mutations for the top 50 mutated genes in GENIE, and compairs them to the same genes in TCGA.

## Query TCGA Data
First query against isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation BigQuery table provided by ISB-CGC. Takes a list of gene symbols and returns a list of TCGA cases (barcodes) that have a SNP called for those genes 

In [None]:
%%bigquery tcga
WITH genes AS (
  SELECT * FROM UNNEST([
      'ABL1',   'AKT1',    'ALK',    'APC',
       'BRAF',   'CDH1',    'CDKN2A', 'CSF1R',
       'CTNNB1', 'EGFR',    'ERBB2',  'ERBB3',
       'ESR1',   'FBXW7',   'FGFR1',  'FGFR2',
       'FGFR3',  'FLT3',    'GNA11',  'GNAQ',
       'GNAS',   'HRAS',    'IDH1',   'IDH2',
       'JAK2',   'JAK3',    'KIT',    'KLLN',
       'KRAS',   'MAP2K1',  'MET',    'MLH1',
       'MPL',    'MYC',     'NOTCH1', 'NRAS',
       'PDGFRA', 'PIK3CA',  'PIK3R1', 'PTEN',
       'PTPN11', 'RB1',     'RET',    'RUNX1',
       'SMAD4',  'SMARCB1', 'SRC',    'STK11',
       'TP53',   'VHL',     'WRAP53'
   ]) AS symbol
), luad AS (
   SELECT COUNT(DISTINCT sample_barcode_tumor) AS unique_samples
     FROM `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation` tcga_mut
    WHERE tcga_mut.sample_barcode_tumor IN (SELECT samplebarcode FROM `isb-cgc.tcga_cohorts.LUAD`)
)
  SELECT genes.symbol Hugo_Symbol,  
         COUNT(DISTINCT sample_barcode_tumor)/(SELECT unique_samples FROM luad) mut_freq
    FROM genes, `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation` tcga_mut
   WHERE genes.symbol = tcga_mut.symbol
     AND tcga_mut.Variant_Type = 'SNP'
     AND tcga_mut.sample_barcode_tumor IN (SELECT samplebarcode FROM `isb-cgc.tcga_cohorts.LUAD`)
GROUP BY genes.symbol
ORDER BY mut_freq DESC

In [None]:
tcga

## Genie Query
Second query against project-genie-query-prod.consortium.mutation. Takes a list of gene symbols and returns a list of GENIE cases (patient IDs) that have a SNP called for those genes.

Code from ISB-CGC Community Notebooks (D. Gibbs) and PHS (J. Slagel) 


In [None]:
%%bigquery genie
WITH genes AS (
  SELECT * FROM UNNEST([
       'ABL1',   'AKT1',    'ALK',    'APC',
       'BRAF',   'CDH1',    'CDKN2A', 'CSF1R',
       'CTNNB1', 'EGFR',    'ERBB2',  'ERBB3',
       'ESR1',   'FBXW7',   'FGFR1',  'FGFR2',
       'FGFR3',  'FLT3',    'GNA11',  'GNAQ',
       'GNAS',   'HRAS',    'IDH1',   'IDH2',
       'JAK2',   'JAK3',    'KIT',    'KLLN',
       'KRAS',   'MAP2K1',  'MET',    'MLH1',
       'MPL',    'MYC',     'NOTCH1', 'NRAS',
       'PDGFRA', 'PIK3CA',  'PIK3R1', 'PTEN',
       'PTPN11', 'RB1',     'RET',    'RUNX1',
       'SMAD4',  'SMARCB1', 'SRC',    'STK11',
       'TP53',   'VHL',     'WRAP53'
   ]) AS symbol
), luad AS (
  SELECT DISTINCT patient_id, sample_id
    FROM `project-genie-query-prod.consortium.sample` 
   WHERE cancer_type_detailed = 'Lung Adenocarcinoma'
), patient AS (
  SELECT COUNT(DISTINCT patient_id) total
    FROM `project-genie-query-prod.consortium.mutation` m, luad
   WHERE m.Tumor_Sample_Barcode = luad.sample_id
)
  SELECT m.Hugo_Symbol, 
         COUNT(DISTINCT luad.patient_id)/(SELECT total FROM patient) mut_freq
    FROM `project-genie-query-prod.consortium.mutation` m, luad
   WHERE m.Hugo_Symbol IN (SELECT symbol FROM genes)
     AND m.Variant_Type ='SNP'
     AND m.Tumor_Sample_Barcode = luad.SAMPLE_ID 
GROUP BY m.Hugo_Symbol 
ORDER BY mut_freq desc

In [None]:
genie

## Merge Tables
Merges the tables using pandas.  A more efficent approach would be to use only a single SQL query to query both data sources.  However this example shows how you can use both BigQuery and pandas within a single example.

In [None]:
import pandas as pd
results = pd.merge(tcga, genie, on='Hugo_Symbol')

## Plot
Visualize the results.

In [None]:
p = results.plot.scatter(x='mut_freq_x', y='mut_freq_y', grid=True)
p.set_title('GENIE vs TCGA % Mutated \n 50 Top GENIE Assay Covered Genes')
p.set_xlabel('TCGA % Mutated')
p.set_ylabel('GENIE % Mutated')
for i, txt in enumerate(results.Hugo_Symbol):
    if results.mut_freq_x.iat[i] > 0.09:
        p.annotate(txt, (results.mut_freq_x.iat[i] + 0.005, results.mut_freq_y.iat[i]))
p.annotate("",
              xy=(0, 0), xycoords='data',
              xytext=(0.43, 0.43), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )